5. Run-time analysis¶
5.1. General remarks¶
Frequently, the numerical treatment of scientific problems can lead to time-consuming code and the question arises how its performance can be improved. There exist a variety of different kinds of approaches. One could for example choose to make use of more powerful hardware or distribute the task on numerous compute nodes of a compute cluster. Even though today, human resources tend to be more costly than hardware resource, the latter should not be wasted by very inefficient code.
In certain cases, speed can be improved by making use of graphical processors (GPU) which are able to handle larger data sets in parallel. While writing code for GPUs may be cumbersome, there exists for example the CuPy library [1] which can serve as a drop-in replacement for most of the NumPy library and will run on NVIDIA GPUs.
On the software side, an option is to make use of optimized code provided by numerical libraries like NumPy and SciPy discussed in Section 4. Sometimes, it can make sense to implement particularly time-consuming parts in the programming language C for which highly optimizing compilers are available. This approach does not necessarily require to write proper C code. One can make use of Cython [2] instead, which will autogenerate C code from Python-like code.
Another option is the use of just in time (JIT) compilers as is done in PyPy [3] and Numba [4]. The latter is readily available through the Anaconda distribution and offers an easy approach to speeding up time-critical parts of the code by simply adding a decorator to a function. Generally, JIT compilers analyze the code before its first execution and create machine code allowing to run the code faster in subsequent calls.
While some of the methods just mentioned can easily be implemented, others may require a significant investment of time. One therefore needs to assess whether the gain in compute time really exceeds the cost in developer time. It is worth following the advice of the eminent computer scientist Donald E. Knuth [5] who wrote already 45 years ago [6]
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97 % of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Before beginning to optimize code, it is important to identify the parts where most of the time is spent and the rest of this chapter will be devoted to techniques allowing to do so. At this point, it is worth emphasizing that a piece of code executed relatively fast can be more relevant than a piece of code executed slowly if the first one is executed very often while the second one is executed only once.
Code optimization often entails a risk for bugs to enter the code. Obviously, fast-running code is not worth anything if it does not produce correct results. It can then be very reassuring if one can rely on a comprehensive test suite (Section 3) and if one has made use of a version control system (Section 2) during code development and code optimization.
Before discussing techniques for timing code execution, we will discuss a few potential pitfalls.
5.2. Some pitfalls in run-time analysis¶
A simple approach to performing a run-time analysis of code could consist in taking
the time before and after the code is executed. The time
module in the Python
standard library offers the possibility to determine the current time:
>>> import time
>>> time.ctime()
'Thu Dec 27 11:13:33 2018'
While this result is nicely readable, it is not well suited to calculate time differences. For this purpose, the seconds passed since the beginning of the epoch are better suited. On Unix systems, the epoch starts on January 1, 1970 at 00:00:00 UTC:
>>> time.time()
1545905769.0189064
Now, it is straightforward to determine the time elapsed during the execution of a piece of code. The following code repeats the execution several times to convey an idea of the fluctuations to be expected.
import time
for _ in range(10):
sum_of_ints = 0
start = time.time()
for n in range(1000000):
sum_of_ints = sum_of_ints + 1
end = time.time()
print(f'{end-start:5.3f}s', end=' ')
Executing this code yields:
0.142s 0.100s 0.093s 0.093s 0.093s 0.093s 0.092s 0.091s 0.091s 0.091s
Doing a second run on the same hardware, we obtained:
0.131s 0.095s 0.085s 0.085s 0.088s 0.085s 0.084s 0.085s 0.085s 0.085s
While these numbers give an idea of the execution time, they should not be taken too
literally. In particular, it makes sense to average over several loops. This is
facilitated by the timeit
module in the Python standard library which we will
discuss in the following section.
When performing run-time analysis as just described, one should be aware that a
computer may be occupied by other tasks as well. In general, the total elapsed
time will thus differ from the time actually needed to execute a specific piece
of code. The time
module therefore provides two functions. In addition to
the time
function which records the wall clock time, there exist a
process_time
function which counts the time attributed to the specific
process running our Python script. The following example demonstrates the
difference by intentionally letting the program pause for a second once in a
while. Note, that although the execution of time.sleep
occurs within the
process under consideration, the time needed is ignored by process_time
.
Therefore, we can use time.sleep
to simulate other activities of the computer,
even if it is done in a somewhat inappropriate way.
import time
sum_of_ints = 0
start = time.time()
start_proc = time.process_time()
for n in range(10):
for m in range(100000):
sum_of_ints = sum_of_ints + 1
time.sleep(1)
end = time.time()
end_proc = time.process_time()
print(f'total time: {end-start:5.3f}s')
print(f'process time: {end_proc-start_proc:5.3f}s')
In a run on the same hardware as used before, we find the following result:
total time: 10.207s
process time: 0.197s
The difference basically consists of the ten seconds spent while the code was sleeping.
One should also be aware that enclosing the code in question in a function will lead to an additional contribution to the execution time. This particularly poses a problem if the execution of the code itself requires only little time. We compare the two scripts
import time
sum_of_ints = 0
start_proc = time.process_time()
for n in range(10000000):
sum_of_ints = sum_of_ints + 1
end_proc = time.process_time()
print(f'process time: {end_proc-start_proc:5.3f}s')
and
import time
def increment_by_one(x):
return x+1
sum_of_ints = 0
start_proc = time.process_time()
for n in range(10000000):
increment_by_one(sum_of_ints)
end_proc = time.process_time()
print(f'process time: {end_proc-start_proc:5.3f}s')
Tht first script takes on average over 10 runs 0.9 seconds while the second script takes 1.1 seconds and thus runs about 20% slower.
Independently of the methods used and even if one of the methods discussed later is employed, a run-time analysis will always influence the execution of the code. The measured run time therefore will be larger than without doing any timing. However, we should still be able to identify the parts of the code which take most of the time.
A disadvantage of the methods discussed so far consists in the fact that they require a modification of the code. Usually, it is desirable to avoid such modifications as much as possible. In the following sections, we will present a few timing techniques which can be used according to the specific needs.
5.3. The timeit
module¶
Short isolated pieces of code can conveniently be analyzed by functions provided
by the timeit
module. By default, the average code execution time will be determined
on the basis of one million of runs. As a first example, let us determine the execution
time for the evaluation of the square of 0.5:
>>> import timeit
>>> timeit.timeit('0.5**2')
0.02171438499863143
The result is given in seconds. In view of one million of code executions, we obtain an execution time of 22 nanoseconds. If we want to use an argument, we cannot define it in the outer scope:
>>> x = 0.5
>>> timeit.timeit('x**2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.6/timeit.py", line 233, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/opt/anaconda3/lib/python3.6/timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
NameError: name 'x' is not defined
Instead, we can pass the global namespace through the globals
argument:
>>> x = 0.5
>>> timeit.timeit('x**2', globals=globals())
0.103586286000791
As an alternative, one can explicitly assign the variable x
in the second
argument intended for setup code. Its execution time is not taken into account:
>>> timeit.timeit('x**2', 'x=0.5')
0.08539198899961775
If we want to compare with the pow
function of the math
module, we have to
add the import statement to the setup code as well:
>>> timeit.timeit('math.pow(x, 2)', 'import math; x=0.5')
0.2346674630025518
A more complex example of the use of the timeit
module compares the
evaluation of a trigonometric function by means of a NumPy universal function
with the use of the corresponding function of the math
module:
import math
import timeit
import numpy as np
import matplotlib.pyplot as plt
def f_numpy(nmax):
x = np.linspace(0, np.pi, nmax)
result = np.sin(x)
def f_math(nmax):
dx = math.pi/(nmax-1)
result = [math.sin(n*dx) for n in range(nmax)]
x = []
y = []
for n in np.logspace(0.31, 6, 300):
nint = int(n)
t_numpy = timeit.timeit('f_numpy(nint)', number=10, globals=globals())
t_math = timeit.timeit('f_math(nint)', number=10, globals=globals())
x.append(nint)
y.append(t_math/t_numpy)
plt.rc('text', usetex=True)
plt.plot(x, y, 'o')
plt.xscale('log')
plt.xlabel('vector size', fontsize=20)
plt.ylabel(r'$t_\mathrm{math}/t_\mathrm{numpy}$', fontsize=20)
plt.show()
The result is displayed in Figure 5.1.
We close this section with two remarks. If one wants to assess the fluctuations of the
measure execution times, one can replace the timeit
function by the repeat
function:
>>> x = 0.5
>>> timeit.repeat('x**2', repeat=10, globals=globals())
[0.1035151930009306, 0.07390781700087246, 0.06162133299949346,
0.05376200799946673, 0.05260805999932927, 0.05276966699966579,
0.05227632500100299, 0.052304120999906445, 0.0523306600007345,
0.05286436900132685]
For users of the IPython shell or the Jupyter notebook, the magics %timeit
and %%timeit
provide a simple way to time the execution of a single line of code or a code cell, respectively.
These magics choose a reasonable number of repetitions to obtain good statistics within a
reasonable amount of time.
5.4. The cProfile
module¶
The timeit
module discussed in the previous section is useful to determine the
execution time of one-liners or very short code segments. It is not very useful though
to determine the compute-time intensive parts of a bigger program. If the program is
nicely modularized in functions and methods, the cProfile
module will be of help.
It determines, how much time is spent in the individual functions and methods and thereby
gives valuable information about which parts will benefit from code optimization.
We consider as a specific example the quantum mechanical time evolution of a narrow Gaussian wave packet initially localized at the center of an infinite potential well [7]. The initial state is decomposed in the appropriately truncated eigenbasis of the potential well. Once the coefficients of the expansion are known, it is straightforward to determine the state at any later time. The time evolution of the probability density is shown in Figure 5.2.
This figure has been obtained by means of the following Python script called
carpet.py
.
1from math import cos, exp, pi, sin, sqrt
2from cmath import exp as cexp
3import numpy as np
4import matplotlib.pyplot as plt
5from matplotlib import cm
6
7class InfiniteWell:
8 def __init__(self, psi0, width, nbase, nint):
9 self.width = width
10 self.nbase = nbase
11 self.nint = nint
12 self.coeffs = self.get_coeffs(psi0)
13
14 def eigenfunction(self, n, x):
15 if n % 2:
16 return sqrt(2/self.width)*sin((n+1)*pi*x/self.width)
17 return sqrt(2/self.width)*cos((n+1)*pi*x/self.width)
18
19 def get_coeffs(self, psi):
20 coeffs = []
21 for n in range(self.nbase):
22 f = lambda x: psi(x)*self.eigenfunction(n, x)
23 c = trapezoidal(f, -0.5*self.width, 0.5*self.width, self.nint)
24 coeffs.append(c)
25 return coeffs
26
27 def psi(self, x, t):
28 psit = 0
29 for n, c in enumerate(self.coeffs):
30 psit = psit + c*cexp(-1j*(n+1)**2*t)*self.eigenfunction(n, x)
31 return psit
32
33def trapezoidal(func, a, b, nint):
34 delta = (b-a)/nint
35 integral = 0.5*(func(a)+func(b))
36 for k in range(1, nint):
37 integral = integral+func(a+k*delta)
38 return delta*integral
39
40def psi0(x):
41 sigma = 0.005
42 return exp(-x**2/(2*sigma))/(pi*sigma)**0.25
43
44w = InfiniteWell(psi0=psi0, width=2, nbase=100, nint=1000)
45x = np.linspace(-0.5*w.width, 0.5*w.width, 500)
46ntmax = 1000
47z = np.zeros((500, ntmax))
48for n in range(ntmax):
49 t = 0.25*pi*n/(ntmax-1)
50 y = np.array([abs(w.psi(x, t))**2 for x in x])
51 z[:, n] = y
52z = z/np.max(z)
53plt.rc('text', usetex=True)
54plt.imshow(z, cmap=cm.hot)
55plt.xlabel('$t$', fontsize=20)
56plt.ylabel('$x$', fontsize=20)
57plt.show()
This code is by no means optimal. After all, we want to discuss strategies to find out where most of the compute time is spent and what we can do to improve the situation. Before doing so, let us get a general idea of how the code works.
The initial wave function is the Gaussian defined in the function psi0
in
lines 40-42. Everything related to the basis functions is collected in the
class InfiniteWell
. During the instantiation, we define the initial wave
function psi0
, the total width of the well width
, the number of basis
states nbase
, and the number of integration points nint
to be used when
determining the coefficients. The expansion coefficients of the initial state
are determined in get_coeffs
defined in lines 19-25 and called from line
12. The value of the eigenfunction corresponding to eigenvalue n
at
position x
is obtained by means of the method eigenfunction
defined
in line 14-17. The integration is carried out very simply according to the
trapezoidal rule as defined in function trapezoidal
in lines 33-38.
The wave function at a given point x
and a given time t
is
calculated by method psi
defined in lines 27-31. The code from line 44
to the end serves to calculate the time evolution and to render the image
shown in Figure 5.2.
In this version of the code, we deliberately do not make use of NumPy except to obtain the image. Of course, NumPy would provide a significant speedup right away and one would probably never write the code in the way shown here. But it provides a good starting point to learn about run-time analysis. Where does the code spend most of its time?
To address this question, we make use of the cProfile
module contained in the
Python standard library. Among the various ways of using this module, we choose
one which avoids having to change our script:
$ python -m cProfile -o carpet.prof carpet.py
This command runs the script carpet.py
under the control of the cProfile
module.
The option -o carpet.prof
indicates that the results of this profiling run are
stored in the file carpet.prof
. This binary file allows to analyze the obtained
data in various ways by means of the pstats
module. Let us try it out:
>>> import pstats
>>> p = pstats.Stats('carpet.prof')
>>> p.sort_stats('time').print_stats(15)
Tue Jan 22 17:04:46 2019 carpet.prof
202073424 function calls (202065686 primitive calls) in 633.175 seconds
Ordered by: internal time
List reduced from 3684 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
50100100 232.867 0.000 354.468 0.000 carpet.py:14(eigenfunction)
500000 184.931 0.000 623.191 0.001 carpet.py:27(psi)
50000000 84.417 0.000 84.417 0.000 {built-in method cmath.exp}
50100101 55.301 0.000 55.301 0.000 {built-in method math.sqrt}
25050064 33.170 0.000 33.170 0.000 {built-in method math.cos}
25050064 33.129 0.000 33.129 0.000 {built-in method math.sin}
1 3.326 3.326 4.341 4.341 {built-in method exec_}
1000 1.878 0.002 625.763 0.626 carpet.py:50(<listcomp>)
503528 0.699 0.000 0.699 0.000 {built-in method builtins.abs}
1430 0.372 0.000 0.372 0.000 {method 'read' of '_io.BufferedReader' objects}
100100 0.348 0.000 1.386 0.000 carpet.py:22(<lambda>)
2 0.300 0.150 0.300 0.150 {built-in method statusBar}
100100 0.294 0.000 0.413 0.000 carpet.py:40(psi0)
100 0.154 0.002 1.540 0.015 carpet.py:33(trapezoidal)
100101 0.119 0.000 0.119 0.000 {built-in method math.exp}
After having imported the pstats
module, we load our profiling file
carpet.prof
to obtain a statistics object p
. The data can then be
sorted with the sort_stats
method according to different criteria. Here, we
have chosen the time spent in a function. Since the list is potentially very
long, we have restricted the output to 15 entries by means of the
print_stats
method.
Let us take a look at the information provided by the run-time statistics. Each
line corresponds to one of the 15 most time-consuming functions and methods out
of a total of 3695. The total time of about 667 seconds is mostly spent in the
function psi
listed in the second line. There are actually two times given
here. The total time (tottime
) of 196 seconds counts only the time actually
spent inside the function. The time required to execute functions called from
psi
are not counted. In contrast, these times count towards the cumulative
time (cumtime
). An important part of the difference can be explained by the
evaluation of the eigenfunctions as listed in the first line.
Since we have sorted according to time
, which actually corresponds to tottime
,
the first line lists the method consuming most of the time. Even though the time
needed per call is so small that it is given as 0.000 seconds, this function is
called very often. In the column ncalls
the corresponding value is listed as
50100100. In such a situation, it makes sense to check whether this number can be
understood.
In a first step, the expansion coefficients need to be determined. We use 100 basis
functions as specified by nbase
and 1001 nodes which is one more than the value
of nint
. This results in a total of 100100 evaluations of eigenfunctions, actually
less than a percent of the total number of evaluations of eigenfunctions. In order
to determine the data displayed in Figure 5.2, we evaluate 100 eigenfunctions on
a grid of size 500×1000, resulting in a total of 50000000 evaluations.
These considerations show that we do not need to bother with improving the code determining the expansion coefficients. However, the situation might be quite different if we would not want to calculate data for a relatively large grid. Thinking a bit more about it, we realize that the number of 50000000 evaluations for the time evolution is much too big. After all, we are evaluating the eigenfunctions at 500 different positions and we are considering 100 eigenfunctions, resulting in only 50000 evaluations. For each of the 1000 time values, we are unnecessarily recalculating eigenfunctions for the same arguments. Avoiding this waste of compute time could speed up our script significantly.
There are basically two ways to do so. We could restructure our program in such a way that we evaluate the grid for constant position along the time direction. Then, we just need to keep the values of the 100 eigenfunctions at a given position. If we want to have the freedom to evaluate the wave function at a given position and time on a certain grid, we could also store the values of all eigenfunctions at all positions on the grid in a cache for later reuse. This is an example of trading compute time against memory. We will implement the latter idea in the next version of our script listed below. [8]
1from math import cos, exp, pi, sin, sqrt
2from cmath import exp as cexp
3import numpy as np
4import matplotlib.pyplot as plt
5from matplotlib import cm
6
7class InfiniteWell:
8 def __init__(self, psi0, width, nbase, nint):
9 self.width = width
10 self.nbase = nbase
11 self.nint = nint
12 self.coeffs = self.get_coeffs(psi0)
13 self.eigenfunction_cache = {}
14
15 def eigenfunction(self, n, x):
16 if n % 2:
17 return sqrt(2/self.width)*sin((n+1)*pi*x/self.width)
18 return sqrt(2/self.width)*cos((n+1)*pi*x/self.width)
19
20 def get_coeffs(self, psi):
21 coeffs = []
22 for n in range(self.nbase):
23 f = lambda x: psi(x)*self.eigenfunction(n, x)
24 c = trapezoidal(f, -0.5*self.width, 0.5*self.width, self.nint)
25 coeffs.append(c)
26 return coeffs
27
28 def psi(self, x, t):
29 if not x in self.eigenfunction_cache:
30 self.eigenfunction_cache[x] = [self.eigenfunction(n, x)
31 for n in range(self.nbase)]
32 psit = 0
33 for n, (c, ef) in enumerate(zip(self.coeffs, self.eigenfunction_cache[x])):
34 psit = psit + c*ef*cexp(-1j*(n+1)**2*t)
35 return psit
36
37def trapezoidal(func, a, b, nint):
38 delta = (b-a)/nint
39 integral = 0.5*(func(a)+func(b))
40 for k in range(1, nint):
41 integral = integral+func(a+k*delta)
42 return delta*integral
43
44def psi0(x):
45 sigma = 0.005
46 return exp(-x**2/(2*sigma))/(pi*sigma)**0.25
47
48w = InfiniteWell(psi0=psi0, width=2, nbase=100, nint=1000)
49x = np.linspace(-0.5*w.width, 0.5*w.width, 500)
50ntmax = 1000
51z = np.zeros((500, ntmax))
52for n in range(ntmax):
53 t = 0.25*pi*n/(ntmax-1)
54 y = np.array([abs(w.psi(x, t))**2 for x in x])
55 z[:, n] = y
56z = z/np.max(z)
57plt.rc('text', usetex=True)
58plt.imshow(z, cmap=cm.hot)
59plt.xlabel('$t$', fontsize=20)
60plt.ylabel('$x$', fontsize=20)
61plt.show()
Now, we check in method psi
whether the eigenfunction cache already contains
data for a given position x
. If this is not the case, the required values are
calculated and the cache is updated.
As a result of this modification of the code, the profiling data change considerably:
Tue Jan 22 17:10:33 2019 carpet.prof
52205250 function calls (52197669 primitive calls) in 185.581 seconds
Ordered by: internal time
List reduced from 3670 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
500000 97.612 0.000 177.453 0.000 carpet.py:28(psi)
50000000 79.415 0.000 79.415 0.000 {built-in method cmath.exp}
1000 2.162 0.002 180.342 0.180 carpet.py:54(<listcomp>)
1 1.176 1.176 2.028 2.028 {built-in method exec_}
503528 0.732 0.000 0.732 0.000 {built-in method builtins.abs}
150100 0.596 0.000 0.954 0.000 carpet.py:15(eigenfunction)
100100 0.353 0.000 1.349 0.000 carpet.py:23(<lambda>)
1430 0.323 0.000 0.323 0.000 {method 'read' of '_io.BufferedReader' objects}
2 0.301 0.151 0.301 0.151 {built-in method statusBar}
100100 0.278 0.000 0.396 0.000 carpet.py:44(psi0)
150101 0.171 0.000 0.171 0.000 {built-in method math.sqrt}
100 0.140 0.001 1.489 0.015 carpet.py:37(trapezoidal)
100101 0.119 0.000 0.119 0.000 {built-in method math.exp}
75064 0.095 0.000 0.095 0.000 {built-in method math.sin}
75064 0.093 0.000 0.093 0.000 {built-in method math.cos}
We observe a speed-up of a factor of 3.6 by investing about 500×100×8 bytes of
memory, i.e. roughly 400 kB. The exact value will be slightly different because
we have stored the data in a dictionary and not in an array, but clearly we are
not talking about a huge amount of memory. The time needed to evaluate the
eigenfunctions has dropped so much that it can be neglected compared to the
time required by the method psi
and the evaluation of the complex
exponential function.
The compute could be reduced further by caching the values of the complex exponential
functions. In fact, we unnecessarily recalculate each value 500 times. However, there
are still almost 96 seconds left which are spent in the rest of the psi
method.
We will see in the following section how one can find out which line of the code is
responsible for this important contribution to the total run time.
Before doing so, we want to present a version of the code designed to profit from NumPy from the very beginning
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
class InfiniteWell:
def __init__(self, psi0, width, nbase, nint):
self.width = width
self.nbase = nbase
self.nint = nint
self.coeffs = trapezoidal(lambda x: psi0(x)*self.eigenfunction(x),
-0.5*self.width, 0.5*self.width, self.nint)
def eigenfunction(self, x):
assert x.ndim == 1
normalization = sqrt(2/self.width)
args = (np.arange(self.nbase)[:, np.newaxis]+1)*np.pi*x/self.width
result = np.empty((self.nbase, x.size))
result[0::2, :] = normalization*np.cos(args[0::2])
result[1::2, :] = normalization*np.sin(args[1::2])
return result
def psi(self, x, t):
coeffs = self.coeffs[:, np.newaxis]
eigenvals = np.arange(self.nbase)[:, np.newaxis]
tvals = t[:, np.newaxis, np.newaxis]
psit = np.sum(coeffs * self.eigenfunction(x)
* np.exp(-1j*(eigenvals+1)**2*tvals), axis= -2)
return psit
def trapezoidal(func, a, b, nint):
delta = (b-a)/nint
x = np.linspace(a, b, nint+1)
integrand = func(x)
integrand[..., 0] = 0.5*integrand[..., 0]
integrand[..., -1] = 0.5*integrand[..., -1]
return delta*np.sum(integrand, axis=-1)
def psi0(x):
sigma = 0.005
return np.exp(-x**2/(2*sigma))/(np.pi*sigma)**0.25
w = InfiniteWell(psi0=psi0, width=2, nbase=100, nint=1000)
x = np.linspace(-0.5*w.width, 0.5*w.width, 500)
t = np.linspace(0, np.pi/4, 1000)
z = np.abs(w.psi(x, t))**2
z = z/np.max(z)
plt.rc('text', usetex=True)
plt.imshow(z.T, cmap=cm.hot)
plt.xlabel('$t$', fontsize=20)
plt.ylabel('$x$', fontsize=20)
plt.show()
The structure of the code is essentially unchanged, but we are making use of
universal functions in several places. In the method psi
, a three-dimensional
array is used with axis 0 to 2 given by time, eigenvalue, and position. A run-time
analysis yields the following result:
Tue Jan 22 17:12:41 2019 carpet.prof
457912 function calls (450291 primitive calls) in 4.644 seconds
Ordered by: internal time
List reduced from 3682 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.707 1.707 2.543 2.543 {built-in method exec_}
1 0.410 0.410 0.481 0.481 carpet.py:23(psi)
1430 0.284 0.000 0.284 0.000 {method 'read' of '_io.BufferedReader' objects}
2 0.276 0.138 0.276 0.138 {built-in method statusBar}
407 0.074 0.000 0.074 0.000 {method 'reduce' of 'numpy.ufunc' objects}
48/46 0.056 0.001 0.061 0.001 {built-in method _imp.create_dynamic}
35469 0.049 0.000 0.053 0.000 {built-in method builtins.isinstance}
284 0.048 0.000 0.048 0.000 {built-in method marshal.loads}
1 0.039 0.039 0.040 0.040 {built-in method show}
2 0.036 0.018 0.210 0.105 /opt/anaconda3/lib/python3.7/site-packages/matplotlib/font_manager.py:1198(_findfont_cached)
418/185 0.028 0.000 0.099 0.001 /opt/anaconda3/lib/python3.7/sre_parse.py:475(_parse)
23838 0.028 0.000 0.028 0.000 {method 'lower' of 'str' objects}
63 0.027 0.000 0.027 0.000 {built-in method io.open}
21637 0.025 0.000 0.025 0.000 {method 'startswith' of 'str' objects}
1169/1101 0.025 0.000 0.200 0.000 {built-in method builtins.__build_class__}
Since the run time obtained by profiling is longer than the actual run time, we have measured the latter for the first version of the script and the NumPy version, resulting in an speed up by a factor of 90. Even though this improvement is impressive, the code given above could be improved even further. However, as this is not the main purpose of the chapter, we rather turn our attention to how one can do profiling on a line-by-line basis.
5.5. Line oriented run-time analysis¶
In the second version of the script discussed in the previous section, we had seen that
by far most of the time was spent in the method psi
. Almost half of the time was spent
in the complex exponential function so that a significant amount of time must be spent
elsewhere in the function. At this point, the cProfile
module is not of much help as
it only works on the function level.
Fortunately, there is a line profiling tool available. However, it is not part of the
Anaconda distribution and needs to be installed separately. The package is called
line_profiler
and can be found on the Python package index (PyPI).
It can be installed either into a virtual environment or in a conda environment.
Line profiling adds some overhead to the code execution and it makes sense to limit
it to the most important function or a few of them. This can easily be done by decorating
the function in question with @profile
. Since we know that the psi
method constitutes
the bottleneck of our calculation, we only decorate that method. Running the line profiler
on our script called carpet.py is done by [9]:
$ kernprof -l -v carpet.py
Here, the option -l
requests the line-by-line profiler and -v
allows to immediately
view the results in addition to storing them in a file with extension lprof
. We obtain
the following result:
Wrote profile results to carpet.py.lprof
Timer unit: 1e-06 s
Total time: 306.266 s
File: carpet.py
Function: psi at line 27
Line # Hits Time Per Hit % Time Line Contents
==============================================================
27 @profile
28 def psi(self, x, t):
29 500000 1171076.0 2.3 0.4 if not self.coeffs:
30 1 348711.0 348711.0 0.1 self.get_coeffs(psi0)
31 500000 1435067.0 2.9 0.5 if not x in self.eigenfunction_cache:
32 500 1256.0 2.5 0.0 self.eigenfunction_cache[x] = [self.eigenfunction(n, x)
33 500 105208.0 210.4 0.0 for n in range(self.nbase)]
34 500000 1190160.0 2.4 0.4 psit = 0
35 50500000 132196091.0 2.6 43.2 for n, (c, ef) in enumerate(zip(self.coeffs, self.eigenfunction_cache[x])):
36 50000000 168606042.0 3.4 55.1 psit = psit + c*ef*cexp(-1j*(n+1)**2*t)
37 500000 1212643.0 2.4 0.4 return psit
The timing information only refers to the function on which the line profiler is run. We can see here that the for loop is responsible for a significant portion of the execution time. Making use of NumPy arrays can improve the performance of the code dramatically as we have seen at the end of the previous section.
By using the option -v
, we were able to immediately see the result of the profiling run.
In addition, a file carpet.py.lprof
has been created. It is possible to obtain the
profiling result from it later by means of:
python -m line_profiler carpet.py.lprof