Timing and Plotting

Agenda

  1. Timing
  2. Building a timing utility
  3. The timeit module
  4. Drawing plots with matplotlib
  5. Timing plots
  6. Timing examples
    • list indexing
    • linear search
    • binary search
    • insertion sort
  7. Takeaways

1. Timing

The time module contains functions for obtaining and interpreting the current system time.

In [1]:
import time
time.time()
Out[1]:
1590552448.431144
In [2]:
time.localtime(time.time())
Out[2]:
time.struct_time(tm_year=2020, tm_mon=5, tm_mday=26, tm_hour=23, tm_min=7, tm_sec=29, tm_wday=1, tm_yday=147, tm_isdst=1)

By taking start and stop "timestamps", we can measure the runtime of code:

In [3]:
start = time.time()
time.sleep(0.5)
end = time.time()
end - start
Out[3]:
0.5038840770721436

2. Building a timing utility

We can build a utility function for timing the execution of a passed-in function:

In [4]:
def timeit(fn):
    start = time.time()
    fn()
    end = time.time()
    return end - start
In [5]:
sum(range(10_000))
Out[5]:
49995000
In [6]:
timeit(lambda: sum(range(10_000)))
Out[6]:
0.00039696693420410156

To make timings more stable, we can run the passed-in function multiple times:

In [7]:
def timeit(fn, number=1):
    total = 0
    for _ in range(number):
        start = time.time()
        fn()
        end = time.time()
        total += end - start
    return total
In [8]:
timeit(lambda: sum(range(10_000)), number=1000)
Out[8]:
0.20748686790466309

Often, we want to time just a portion of a function, or we need to run some setup code (that we don't wish to time) before running the function to time ...

3. The timeit module

The timeit module is a built-in library for measuring the execution of code passed in as a string. It also supports passing in "setup" code that is not timed.

In [9]:
import timeit
timeit.timeit('sum(r)', setup='r = range(10_000)', number=1000)
Out[9]:
0.20611036699999907

We can easily use this to gather timings for multiple input values:

In [10]:
[timeit.timeit('sum(r)',
               setup='r = range({})'.format(n),
               number=1000)
 for n in range(1000, 10_000, 1000)]
Out[10]:
[0.03044588699999906,
 0.049285558999999424,
 0.050086780000000886,
 0.06476718200000064,
 0.07875248400000068,
 0.09941238999999946,
 0.11709398100000001,
 0.1421522520000007,
 0.1569357680000003]

Sometimes we might want to make use of functions defined in our notebook in the timed/setup code passed to timeit. We need to use the globals argument for this:

In [11]:
def fib(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fib(n-1) + fib(n-2)
In [12]:
[timeit.timeit('fib({})'.format(n),
               number=100,
               globals=globals()) # recall: "globals()" returns a dictionary of everything
                                  # defined in this module; timeit needs it to access `fib`
 for n in range(1, 11)]
Out[12]:
[2.2306999998278343e-05,
 6.963099999879319e-05,
 0.00017223500000085323,
 0.0003711939999995195,
 0.0008683090000012328,
 0.0019842690000011487,
 0.003189415000001361,
 0.006689989000001617,
 0.013757366000000104,
 0.022243749999997675]

4. Drawing plots with matplotlib

The matplotlib library supports the creation of all sorts of visualizations. We will use it for drawing simple 2-dimensional plots.

The primary plotting function we will use is matplotlib.pyplot.plot (full documentation here), which, when passed two "array-like" objects of equal length, will interpret and plot their contents as x and y axis coordinates. We will generally use tuples, lists, and ranges as array-like objects. Note that generators are not considered array-like by matplotlib.

Some examples (note that we use a semicolon after the call to plot to hide its return value):

In [85]:
import matplotlib.pyplot as plt
import numpy as np
import math

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 6] # set size of plot
In [86]:
plt.plot([1, 2, 3, 4, 5], [50, 20, 30, 10, 40]);
In [87]:
xs = range(11)
ys = [x*2 for x in xs]
plt.plot(xs, ys);

We can also provide an optional format string to plot, which controls marker, line-style, and color for the plot.

Here's a shortened list of options copied from the full documentation of plot:

Markers

  • . : point marker
  • o : circle marker
  • s : square marker
  • d : diamond marker

Line-styles

  • - : solid line style
  • -- : dashed line style
  • : : dotted line style

Colors

  • k : black
  • r : red
  • g : blue
  • b : green
  • y : yellow
  • c : cyan

Here are the above plots with some color and styling (if we omit a line style no connecting line is drawn between data points):

In [88]:
plt.plot([1, 2, 3, 4, 5], [50, 20, 30, 10, 40], 'o--r');
In [89]:
xs = range(11)
ys = [x*2 for x in xs]
plt.plot(xs, ys, 'dg');

Instead of regular range objects, which only allow for integral start/stop/step values, we typically prefer to use the numpy library's arange and linspace functions with matplotlib. arange is like range, except we can use floating point values for start/stop/step. linspace lets us specify start and stop values (both inclusive), and the number of values to return in that interval.

Examples of arange and linspace calls (note that both functions return numpy arrays, which are iterable and can be passed to plot):

In [18]:
np.arange(0.5, 2.5, 0.1)
Out[18]:
array([0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7,
       1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4])
In [19]:
np.linspace(10, 20, 41)
Out[19]:
array([10.  , 10.25, 10.5 , 10.75, 11.  , 11.25, 11.5 , 11.75, 12.  ,
       12.25, 12.5 , 12.75, 13.  , 13.25, 13.5 , 13.75, 14.  , 14.25,
       14.5 , 14.75, 15.  , 15.25, 15.5 , 15.75, 16.  , 16.25, 16.5 ,
       16.75, 17.  , 17.25, 17.5 , 17.75, 18.  , 18.25, 18.5 , 18.75,
       19.  , 19.25, 19.5 , 19.75, 20.  ])
In [20]:
np.linspace(1, 100_000, 50, dtype=int) # we can specify the data type to coerce values into integers
Out[20]:
array([     1,   2041,   4082,   6123,   8164,  10204,  12245,  14286,
        16327,  18368,  20408,  22449,  24490,  26531,  28572,  30612,
        32653,  34694,  36735,  38776,  40816,  42857,  44898,  46939,
        48980,  51020,  53061,  55102,  57143,  59184,  61224,  63265,
        65306,  67347,  69388,  71428,  73469,  75510,  77551,  79592,
        81632,  83673,  85714,  87755,  89796,  91836,  93877,  95918,
        97959, 100000])

plot can be called multiple times in the same cell to draw multiple lines in the same chart. Below we use this facility together with linspace and a handful of list comprehensions to plot some common runtime complexity bounding functions (more on that soon) over a small interval:

In [90]:
count = 100
xs = np.linspace(0.1, 4, count)
ys_const        = [1] * count
ys_log          = [math.log(x) for x in xs]
ys_linear       = [x for x in xs]
ys_linearithmic = [x * math.log(x) for x in xs]
ys_quadratic    = [x**2 for x in xs]
ys_exponential  = [2**x for x in xs]

plt.plot(xs, ys_const, 'c')
plt.plot(xs, ys_log, 'r')
plt.plot(xs, ys_linear, 'b')
plt.plot(xs, ys_linearithmic, 'g')
plt.plot(xs, ys_quadratic, 'y');
plt.plot(xs, ys_exponential, 'm');

5. Plotting timings

Plotting timing data collected from functions may help give us a sense of how their runtimes scale with increasing input sizes.

In [99]:
ns = np.linspace(10, 10_000, 50, dtype=int)
ts = [timeit.timeit('sum(range({}))'.format(n), number=100)
      for n in ns]

plt.plot(ns, ts, 'or');

Clearly, the runtime of sum is directly proportional to the number of values it operates on.

If we assume a linear relationship, we can compute the average slope between adjacent data points to come up with an line of approximate fit that may help us predict the runtime of sum.

In [100]:
total = 0
for i in range(len(ns)-1):
    x0, x1 = ns[i:i+2]
    y0, y1 = ts[i:i+2]
    total += (y1-y0) / (x1-x0) # recall: slope is (rise/run)
avg_slope = total / (len(ns)-1)
In [101]:
plt.plot(ns, ts, 'or')
plt.plot(ns, [avg_slope*n for n in ns], '--b');
In [102]:
# i.e., for input of size N, runtime is estimated at:
for n in np.linspace(1, 100_000_000, 11, dtype=int):
    print('Runtime of sum(range({:>11,})) ~ {:>5.2f} s'.format(n, avg_slope*n/100))
Runtime of sum(range(          1)) ~  0.00 s
Runtime of sum(range( 10,000,000)) ~  0.16 s
Runtime of sum(range( 20,000,000)) ~  0.33 s
Runtime of sum(range( 30,000,000)) ~  0.49 s
Runtime of sum(range( 40,000,000)) ~  0.65 s
Runtime of sum(range( 50,000,000)) ~  0.81 s
Runtime of sum(range( 60,000,000)) ~  0.98 s
Runtime of sum(range( 70,000,000)) ~  1.14 s
Runtime of sum(range( 80,000,000)) ~  1.30 s
Runtime of sum(range( 90,000,000)) ~  1.46 s
Runtime of sum(range(100,000,000)) ~  1.63 s

We can also use polyfit to compute a best-fitting polynomial function of arbitrary degree for our data:

In [103]:
degree = 10
coeffs = np.polyfit(ns, ts, degree)
p = np.poly1d(coeffs)
plt.plot(ns, ts, 'or')
plt.plot(ns, [p(n) for n in ns], '-b');

Is there a downside to this approach?

In [104]:
# i.e., for input of size N, runtime is estimated at:
for n in np.linspace(1, 100_000_000, 11, dtype=int):
    print('Runtime of sum(range({:>11,})) ~ {:>5.2f} s'.format(n, p(n)/100))
Runtime of sum(range(          1)) ~  0.00 s
Runtime of sum(range( 10,000,000)) ~ -1061166038612148214858296328192.00 s
Runtime of sum(range( 20,000,000)) ~ -1089385486750761257780562888753152.00 s
Runtime of sum(range( 30,000,000)) ~ -62872419687240159361226066408505344.00 s
Runtime of sum(range( 40,000,000)) ~ -1116941744122669540916974065576574976.00 s
Runtime of sum(range( 50,000,000)) ~ -10404960494037728402994110768519053312.00 s
Runtime of sum(range( 60,000,000)) ~ -64435630500995599779798973663525470208.00 s
Runtime of sum(range( 70,000,000)) ~ -301055283557341317105883636077512622080.00 s
Runtime of sum(range( 80,000,000)) ~ -1144471359156596341124961755651702259712.00 s
Runtime of sum(range( 90,000,000)) ~ -3716726857822636185875970114759235207168.00 s
Runtime of sum(range(100,000,000)) ~ -10660067275857277685396062067567789867008.00 s

Choosing an ill-fitting function will likely result in inaccurate runtime predictions. Worse, inaccuracies are compounded as input sizes grow large!

How do we know what class of function to use (e.g., linear, nth-degree polynomial, exponential) for modeling the runtime behavior of algorithms?

Can we reliably determine this through empirical observation?

6. Timing Examples

Built-in list indexing

What is the runtime behavior of list-indexing?

In [135]:
lst = list(range(1_000_000))
ns = np.linspace(0, len(lst), 1000, endpoint=False, dtype=int)
ts = [timeit.timeit('_ = lst[{}]'.format(n),
                    globals=globals(), 
                    number=10000) 
      for n in ns]

plt.plot(ns, ts, 'or');

Observation: accessing an element in a list by index -- regardless of where in the list the element is located -- takes a uniform/constant amount of time.

How? A Python list uses an array as its underlying data storage mechanism. Every "slot" of an array is a reference (i.e., a fixed-width address) to an object, and to access an element at a particular index, the underlying code:

  1. Computes an offset into the array by multiplying the index by the size of a reference
  2. Adds the computed offset to the base address of the array, giving us the address of the reference
  3. Accesses the reference and uses it to load the associated element

Each of the steps above can be performed in constant time.

What is the runtime behavior of searching for an element in an unsorted list?

In [106]:
def contains(lst, x):
    for y in lst:
        if x == y:
            return True
    else:
        return False
In [107]:
import random
lst = list(range(100))
random.shuffle(lst)

contains(lst, 10)
Out[107]:
True
In [108]:
# runtimes when searching for a present element in a randomly shuffled list

ns = np.linspace(10, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, 0)', 
                    setup='lst=list(range({})); random.shuffle(lst)'.format(n),
                    globals=globals(),
                    number=100)
      for n in ns]

plt.plot(ns, ts, 'or');
In [109]:
# runtimes when searching for an element that is not present

ns = np.linspace(1_000, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, -1)', 
                    setup='lst=list(range({}))'.format(n),
                    globals=globals(),
                    number=100)
      for n in ns]

plt.plot(ns, ts, 'or');

What is the runtime behavior of searching for an element in a sorted list using binary search?

In [110]:
def contains(lst, x):
    lo = 0
    hi = len(lst)-1
    while lo <= hi:
        mid = (lo + hi) // 2
        if x < lst[mid]:
            hi = mid - 1
        elif x > lst[mid]:
            lo = mid + 1
        else:
            return True
    else:
        return False
In [111]:
lst = list(range(1000))
contains(lst, 10)
Out[111]:
True
In [113]:
# runtimes when searching for different values in a fixed-size list

lst = list(range(1000))
ns = range(1000)
ts = [timeit.timeit('contains(lst, {})'.format(x), 
                    globals=globals(), 
                    number=1000)
      for x in range(1000)]

plt.plot(ns, ts, 'or');
In [114]:
# runtimes when searching for an edge-value in lists of increasing size

ns = np.linspace(10, 10_000, 100, dtype=int)
ts = [timeit.timeit('contains(lst, 0)', 
                    setup='lst=list(range({}))'.format(n),
                    globals=globals(),
                    number=1000)
      for n in ns]

plt.plot(ns, ts, 'or');

Insertion sort

What is the runtime behavior of insertion sort?

In [115]:
def insertion_sort(lst):
    for i in range(1, len(lst)):
        for j in range(i, 0, -1):
            if lst[j-1] > lst[j]:
                lst[j-1], lst[j] = lst[j], lst[j-1]
            else:
                break
In [116]:
import random
lst = list(range(1000))
random.shuffle(lst)
plt.plot(lst, 'og');
In [117]:
insertion_sort(lst)
plt.plot(lst, 'og');
In [119]:
# timings for a randomized list

ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
                    setup='lst=list(range({})); random.shuffle(lst)'.format(n),
                    globals=globals(),
                    number=1)
         for n in ns]

plt.plot(ns, ts, 'or');
In [125]:
# timings for an already sorted list

ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
                    setup='lst=list(range({}))'.format(n),
                    globals=globals(),
                    number=1)
         for n in ns]

plt.plot(ns, ts, 'or');
In [126]:
# timings for a reversed list

ns = np.linspace(100, 2000, 15, dtype=int)
ts = [timeit.timeit('insertion_sort(lst)',
                    setup='lst=list(reversed(range({})))'.format(n),
                    globals=globals(),
                    number=1)
         for n in ns]

plt.plot(ns, ts, 'or');
In [127]:
ns = np.linspace(100, 2000, 15, dtype=int)
ts1 = [timeit.timeit('insertion_sort(lst)',
                     setup='lst=list((range({})))'.format(n),
                     globals=globals(),
                     number=1)
       for n in ns]
ts2 = [timeit.timeit('insertion_sort(lst)',
                     setup='lst=list(range({})); random.shuffle(lst)'.format(n),
                     globals=globals(),
                     number=1)
       for n in ns]

ts3 = [timeit.timeit('insertion_sort(lst)',
                     setup='lst=list(reversed(range({})))'.format(n),
                     globals=globals(),
                     number=1)
       for n in ns]

plt.plot(ns, ts1, 'og');
plt.plot(ns, ts2, 'ob');
plt.plot(ns, ts3, 'or');

Collatz conjecture

The Collatz conjecture defines a series of numbers starting with any positive integer $n$, where subsequent terms in the series are computed with the following function:

$f(n) = \begin{cases} n/2 &\mbox{if $n$ is even} \\ 3n+1 & \mbox{if $n$ is odd} \end{cases}$

The conjecture is that regardless of the starting integer, the series ends in 1.

What is the runtime behavior of the Collatz series generating function, for increasing values of $n$?

In [132]:
def collatz(n):
    while n != 1:
        # print(n)
        if n%2 == 0:
            n = n // 2
        else:
            n = 3*n + 1
In [131]:
collatz(9)
9
28
14
7
22
11
34
17
52
26
13
40
20
10
5
16
8
4
2
In [134]:
ns = np.linspace(1, 100_000, 200, dtype=int)
ts = [timeit.timeit('collatz({})'.format(n),
                     globals=globals(),
                     number=100)
      for n in ns]

plt.plot(ns, ts, 'or');

Proving the conjecture is an open research problem!

7. Takeaways

  • timing and plotting libraries allow us to systematically measure and visualize the runtime behavior of algorithms over different inputs
  • different characteristics of input (e.g., shuffled, ordered, reversed) can have a profound impact on the runtime of algorithms
  • empirical runtime measurements do not always paint a clear, accurate, or consistent picture of the long-term runtime behavior of a function
  • choosing the wrong class of function to describe the runtime behavior of an algorithm can result in disastrously wrong predictions
  • timing results are useful, but we need a more systematic and rigorous way of describing and comparing the runtime behavior of algorithms!