timeit
modulematplotlib
The time
module contains functions for obtaining and interpreting the current system time.
import time
time.time()
time.localtime(time.time())
By taking start and stop "timestamps", we can measure the runtime of code:
start = time.time()
time.sleep(0.5)
end = time.time()
end - start
We can build a utility function for timing the execution of a passed-in function:
def timeit(fn):
start = time.time()
fn()
end = time.time()
return end - start
sum(range(10_000))
timeit(lambda: sum(range(10_000)))
To make timings more stable, we can run the passed-in function multiple times:
def timeit(fn, number=1):
total = 0
for _ in range(number):
start = time.time()
fn()
end = time.time()
total += end - start
return total
timeit(lambda: sum(range(10_000)), number=1000)
Often, we want to time just a portion of a function, or we need to run some setup code (that we don't wish to time) before running the function to time ...
import timeit
timeit.timeit('sum(r)', setup='r = range(10_000)', number=1000)
We can easily use this to gather timings for multiple input values:
[timeit.timeit('sum(r)',
setup='r = range({})'.format(n),
number=1000)
for n in range(1000, 10_000, 1000)]
Sometimes we might want to make use of functions defined in our notebook in the timed/setup code passed to timeit
. We need to use the globals
argument for this:
def fib(n):
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n-1) + fib(n-2)
[timeit.timeit('fib({})'.format(n),
number=100,
globals=globals()) # recall: "globals()" returns a dictionary of everything
# defined in this module; timeit needs it to access `fib`
for n in range(1, 11)]
matplotlib
¶The matplotlib library supports the creation of all sorts of visualizations. We will use it for drawing simple 2-dimensional plots.
The primary plotting function we will use is matplotlib.pyplot.plot
(full documentation here), which, when passed two "array-like" objects of equal length, will interpret and plot their contents as x and y axis coordinates. We will generally use tuples, lists, and ranges as array-like objects. Note that generators are not considered array-like by matplotlib.
Some examples (note that we use a semicolon after the call to plot
to hide its return value):
import matplotlib.pyplot as plt
import numpy as np
import math
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 6] # set size of plot
plt.plot([1, 2, 3, 4, 5], [50, 20, 30, 10, 40]);
xs = range(11)
ys = [x*2 for x in xs]
plt.plot(xs, ys);
We can also provide an optional format string to plot
, which controls marker, line-style, and color for the plot.
Here's a shortened list of options copied from the full documentation of plot
:
Markers
.
: point markero
: circle markers
: square markerd
: diamond markerLine-styles
-
: solid line style--
: dashed line style:
: dotted line styleColors
k
: blackr
: redg
: blueb
: greeny
: yellowc
: cyanHere are the above plots with some color and styling (if we omit a line style no connecting line is drawn between data points):
plt.plot([1, 2, 3, 4, 5], [50, 20, 30, 10, 40], 'o--r');
xs = range(11)
ys = [x*2 for x in xs]
plt.plot(xs, ys, 'dg');
Instead of regular range
objects, which only allow for integral start/stop/step values, we typically prefer to use the numpy library's arange
and linspace
functions with matplotlib. arange
is like range
, except we can use floating point values for start/stop/step. linspace
lets us specify start and stop values (both inclusive), and the number of values to return in that interval.
Examples of arange
and linspace
calls (note that both functions return numpy arrays, which are iterable and can be passed to plot
):
np.arange(0.5, 2.5, 0.1)
np.linspace(10, 20, 41)
np.linspace(1, 100_000, 50, dtype=int) # we can specify the data type to coerce values into integers
plot
can be called multiple times in the same cell to draw multiple lines in the same chart. Below we use this facility together with linspace
and a handful of list comprehensions to plot some common runtime complexity bounding functions (more on that soon) over a small interval:
count = 100
xs = np.linspace(0.1, 4, count)
ys_const = [1] * count
ys_log = [math.log(x) for x in xs]
ys_linear = [x for x in xs]
ys_linearithmic = [x * math.log(x) for x in xs]
ys_quadratic = [x**2 for x in xs]
ys_exponential = [2**x for x in xs]
plt.plot(xs, ys_const, 'c')
plt.plot(xs, ys_log, 'r')
plt.plot(xs, ys_linear, 'b')
plt.plot(xs, ys_linearithmic, 'g')
plt.plot(xs, ys_quadratic, 'y');
plt.plot(xs, ys_exponential, 'm');
Plotting timing data collected from functions may help give us a sense of how their runtimes scale with increasing input sizes.
ns = np.linspace(10, 10_000, 50, dtype=int)
ts = [timeit.timeit('sum(range({}))'.format(n), number=100)
for n in ns]
plt.plot(ns, ts, 'or');
Clearly, the runtime of sum
is directly proportional to the number of values it operates on.
If we assume a linear relationship, we can compute the average slope between adjacent data points to come up with an line of approximate fit that may help us predict the runtime of sum
.
total = 0
for i in range(len(ns)-1):
x0, x1 = ns[i:i+2]
y0, y1 = ts[i:i+2]
total += (y1-y0) / (x1-x0) # recall: slope is (rise/run)
avg_slope = total / (len(ns)-1)
plt.plot(ns, ts, 'or')
plt.plot(ns, [avg_slope*n for n in ns], '--b');
# i.e., for input of size N, runtime is estimated at:
for n in np.linspace(1, 100_000_000, 11, dtype=int):
print('Runtime of sum(range({:>11,})) ~ {:>5.2f} s'.format(n, avg_slope*n/100))
We can also use polyfit
to compute a best-fitting polynomial function of arbitrary degree for our data:
degree = 10
coeffs = np.polyfit(ns, ts, degree)
p = np.poly1d(coeffs)
plt.plot(ns, ts, 'or')
plt.plot(ns, [p(n) for n in ns], '-b');
Is there a downside to this approach?
# i.e., for input of size N, runtime is estimated at:
for n in np.linspace(1, 100_000_000, 11, dtype=int):
print('Runtime of sum(range({:>11,})) ~ {:>5.2f} s'.format(n, p(n)/100))
Choosing an ill-fitting function will likely result in inaccurate runtime predictions. Worse, inaccuracies are compounded as input sizes grow large!
How do we know what class of function to use (e.g., linear, nth-degree polynomial, exponential) for modeling the runtime behavior of algorithms?
Can we reliably determine this through empirical observation?
What is the runtime behavior of list-indexing?
lst = list(range(1_000_000))
ns = np.linspace(0, len(lst), 1000, endpoint=False, dtype=int)
ts = [timeit.timeit('_ = lst[{}]'.format(n),
globals=globals(),
number=10000)
for n in ns]
plt.plot(ns, ts, 'or');
Observation: accessing an element in a list by index -- regardless of where in the list the element is located -- takes a uniform/constant amount of time.
How? A Python list uses an array as its underlying data storage mechanism. Every "slot" of an array is a reference (i.e., a fixed-width address) to an object, and to access an element at a particular index, the underlying code:
Each of the steps above can be performed in constant time.
What is the runtime behavior of searching for an element in an unsorted list?
def contains(lst, x):
for y in lst:
if x == y:
return True
else:
return False
import random
lst = list(range(100))
random.shuffle(lst)
contains(lst, 10)
# runtimes when searching for a present element in a randomly shuffled list
ns = np.linspace(10, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, 0)',
setup='lst=list(range({})); random.shuffle(lst)'.format(n),
globals=globals(),
number=100)
for n in ns]
plt.plot(ns, ts, 'or');
# runtimes when searching for an element that is not present
ns = np.linspace(1_000, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, -1)',
setup='lst=list(range({}))'.format(n),
globals=globals(),
number=100)
for n in ns]
plt.plot(ns, ts, 'or');
What is the runtime behavior of searching for an element in a sorted list using binary search?
def contains(lst, x):
lo = 0
hi = len(lst)-1
while lo <= hi:
mid = (lo + hi) // 2
if x < lst[mid]:
hi = mid - 1
elif x > lst[mid]:
lo = mid + 1
else:
return True
else:
return False
lst = list(range(1000))
contains(lst, 10)
# runtimes when searching for different values in a fixed-size list
lst = list(range(1000))
ns = range(1000)
ts = [timeit.timeit('contains(lst, {})'.format(x),
globals=globals(),
number=1000)
for x in range(1000)]
plt.plot(ns, ts, 'or');
# runtimes when searching for an edge-value in lists of increasing size
ns = np.linspace(10, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, 0)',
setup='lst=list(range({}))'.format(n),
globals=globals(),
number=1000)
for n in ns]
plt.plot(ns, ts, 'or');
What is the runtime behavior of insertion sort?
def insertion_sort(lst):
for i in range(1, len(lst)):
for j in range(i, 0, -1):
if lst[j-1] > lst[j]:
lst[j-1], lst[j] = lst[j], lst[j-1]
else:
break
import random
lst = list(range(1000))
random.shuffle(lst)
plt.plot(lst, 'og');
insertion_sort(lst)
plt.plot(lst, 'og');
# timings for a randomized list
ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
setup='lst=list(range({})); random.shuffle(lst)'.format(n),
globals=globals(),
number=1)
for n in ns]
plt.plot(ns, ts, 'or');
# timings for an already sorted list
ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
setup='lst=list(range({}))'.format(n),
globals=globals(),
number=1)
for n in ns]
plt.plot(ns, ts, 'or');
# timings for a reversed list
ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
setup='lst=list(reversed(range({})))'.format(n),
globals=globals(),
number=1)
for n in ns]
plt.plot(ns, ts, 'or');
ns = np.linspace(100, 2000, 15, dtype=int)
ts1 = [timeit.timeit('insertion_sort(lst)',
setup='lst=list((range({})))'.format(n),
globals=globals(),
number=1)
for n in ns]
ts2 = [timeit.timeit('insertion_sort(lst)',
setup='lst=list(range({})); random.shuffle(lst)'.format(n),
globals=globals(),
number=1)
for n in ns]
ts3 = [timeit.timeit('insertion_sort(lst)',
setup='lst=list(reversed(range({})))'.format(n),
globals=globals(),
number=1)
for n in ns]
plt.plot(ns, ts1, 'og');
plt.plot(ns, ts2, 'ob');
plt.plot(ns, ts3, 'or');
The Collatz conjecture defines a series of numbers starting with any positive integer $n$, where subsequent terms in the series are computed with the following function:
$f(n) = \begin{cases} n/2 &\mbox{if $n$ is even} \\ 3n+1 & \mbox{if $n$ is odd} \end{cases}$
The conjecture is that regardless of the starting integer, the series ends in 1.
What is the runtime behavior of the Collatz series generating function, for increasing values of $n$?
def collatz(n):
while n != 1:
# print(n)
if n%2 == 0:
n = n // 2
else:
n = 3*n + 1
collatz(9)
ns = np.linspace(1, 100_000, 200, dtype=int)
ts = [timeit.timeit('collatz({})'.format(n),
globals=globals(),
number=100)
for n in ns]
plt.plot(ns, ts, 'or');
Proving the conjecture is an open research problem!