Between the array-backed and linked list we have:

- $O(1)$ indexing (array-backed)
- $O(1)$ appending (array-backed & linked)
- $O(1)$ insertion/deletion without indexing (linked)
- $O(\log N)$ binary search, when sorted (array-backed)

`dict`

¶In [1]:

```
import timeit
def lin_search(lst, x):
for i in range(len(lst)):
if lst[i] == x:
return i
raise ValueError(x)
def bin_search(lst, x):
# assume that lst is sorted!!!
low = 0
hi = len(lst)
mid = (low + hi) // 2
while lst[mid] != x and low <= hi:
if lst[mid] < x:
low = mid + 1
else:
hi = mid - 1
mid = (low + hi) // 2
if lst[mid] == x:
return mid
else:
raise ValueError(x)
def time_lin_search(size):
return timeit.timeit('lin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange
'import random ; from __main__ import lin_search ;'
'lst = [x for x in range({})]'.format(size), # interpolate size into list range
number=100)
def time_bin_search(size):
return timeit.timeit('bin_search(lst, random.randrange({}))'.format(size), # interpolate size into randrange
'import random ; from __main__ import bin_search ;'
'lst = [x for x in range({})]'.format(size), # interpolate size into list range
number=100)
def time_dict(size):
return timeit.timeit('dct[random.randrange({})]'.format(size),
'import random ; '
'dct = {{x: x for x in range({})}}'.format(size),
number=100)
lin_search_timings = [time_lin_search(n)
for n in range(10, 10000, 100)]
bin_search_timings = [time_bin_search(n)
for n in range(10, 10000, 100)]
dict_timings = [time_dict(n)
for n in range(10, 10000, 100)]
```

In [2]:

```
%matplotlib inline
import matplotlib.pyplot as plt
# plt.plot(lin_search_timings, 'ro')
plt.plot(bin_search_timings, 'gs')
plt.plot(dict_timings, 'b^')
plt.show()
```

In [3]:

```
class LookupDS:
def __init__(self):
self.data = []
def __setitem__(self, key, value):
self.data.append((key, value))
def __getitem__(self, key):
for k, v in self.data:
if k == key:
return v
raise KeyError
def __contains__(self, key):
try:
_ = self[key]
return True
except:
return False
```

Demonstrates the relevant APIs, but is $O(N)$ for lookup (via `__getitem__`

) operations.

Hashes (a.k.a. hash codes or hash values) are simply numerical values computed for objects.

In [4]:

```
hash('hello')
```

Out[4]:

In [5]:

```
[hash(s) for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]
```

Out[5]:

Note that a given object will always hash to the same value (e.g., the string `'different'`

).

In [6]:

```
hash('hello') % 100
```

Out[6]:

In [7]:

```
# reminder of the semantics of the modulus operator
n = 7
for x in range(20):
print('{} % {} = {}'.format(x, n, x % n))
```

In [8]:

```
[hash(s) % 100 for s in ['different', 'objects', 'have', 'very', 'different', 'hashes']]
```

Out[8]:

In [9]:

```
class Hashtable:
def __init__(self, n_buckets=1000):
self.buckets = [None] * n_buckets
def __setitem__(self, key, val):
bucket_idx = hash(key) % len(self.buckets)
self.buckets[bucket_idx] = (key, val)
def __getitem__(self, key):
bucket_idx = hash(key) % len(self.buckets)
tup = self.buckets[bucket_idx]
if tup and tup[0] == key:
return tup[1]
raise KeyError
def __contains__(self, key):
try:
_ = self[key]
return True
except:
return False
```

In [10]:

```
h = Hashtable(10)
h.buckets
```

Out[10]:

In [11]:

```
h['batman'] = 'bruce wayne'
h['superman'] = 'clark kent'
```

In [12]:

```
h['batman']
```

Out[12]:

In [13]:

```
h['superman']
```

Out[13]:

In [14]:

```
h.buckets
```

Out[14]:

In [15]:

```
h = Hashtable(1)
h['batman'] = 'bruce wayne'
h['superman'] = 'clark kent'
```

In [16]:

```
h['batman']
```

In [17]:

```
h.buckets
```

Out[17]:

Given a limited number of buckets, more than one key will inevitably map to the same bucket, resulting in a **collision**.

How likely are collisions to happen before they become an absolute certainty (i.e., when keys outnumber buckets)?

Problem statement: Given $N$ people at a party, how likely is it that at least two people will have the same birthday?

This is easiest to calculate by considering the inverse: what is the probability that everyone has a unique birthday?

For $1$ person, the probability is 1 (100%).

For $2$ people, it is simply the likelihood that the second person is born on any of the other 364 days of the year, i.e., $\frac{364}{365}$.

For $3$ people, the third person has to be born on one of the remaining 363 days. We combine ("and" together) probabilities by multiplying them, so the likelihood that three people are born on different days is $1 \times \frac{364}{365} \times \frac{363}{365}$.

For $N \le 365$ people, the likelihood that everyone is born on a different day is simply: $\frac{365}{365} \times \frac{364}{365} \times \cdots \times \frac{365-(N-1)}{365}$

Having computed the inverse probability, we can determine how likely it is that two or more people will have the same birthday by simply subtracting the inverse from 1. The function below does this:

In [18]:

```
def birthday_p(n_people):
p_inv = 1
for n in range(365, 365-n_people, -1):
p_inv *= n / 365
return 1 - p_inv
```

In [19]:

```
birthday_p(2)
```

Out[19]:

In [20]:

```
1 - 364/365 # note equivalence to above
```

Out[20]:

In [21]:

```
birthday_p(23) # 50 percent likelihood with 23 people!
```

Out[21]:

In [22]:

```
birthday_p(57) # over 99% likelihood with 57 people!
```

Out[22]:

In [23]:

```
%matplotlib inline
import matplotlib.pyplot as plt
n_people = range(1, 80)
plt.plot(n_people, [birthday_p(n) for n in n_people])
plt.show()
```

In [24]:

```
def collision_p(n_values, n_buckets):
p_inv = 1
for n in range(n_buckets, n_buckets-n_values, -1):
p_inv *= n / n_buckets
return 1 - p_inv
```

In [25]:

```
collision_p(23, 365) # same as birthday problem, for 23 people
```

Out[25]:

In [26]:

```
collision_p(10, 100)
```

Out[26]:

In [27]:

```
collision_p(100, 1000)
```

Out[27]:

In [28]:

```
# keeping number of values fixed at 100, but vary number of buckets: visualize probability of collision
%matplotlib inline
import matplotlib.pyplot as plt
n_buckets = range(100, 100001, 1000)
plt.plot(n_buckets, [collision_p(100, nb) for nb in n_buckets])
plt.show()
```

In [29]:

```
def avg_num_collisions(n, b):
"""Returns the expected number of collisions for n values uniformly distributed
over a hashtable of b buckets. Based on (fairly) elementary probability theory.
(Pay attention in MATH 474!)"""
return n - b + b * (1 - 1/b)**n
```

In [30]:

```
avg_num_collisions(28, 365)
```

Out[30]:

In [31]:

```
avg_num_collisions(1000, 1000)
```

Out[31]:

In [32]:

```
avg_num_collisions(1000, 10000)
```

Out[32]:

In [33]:

```
class Hashtable:
class Node:
def __init__(self, key, val, next=None):
self.key = key
self.val = val
self.next = next
def __init__(self, n_buckets=1000):
self.buckets = [None] * n_buckets
def __setitem__(self, key, val):
bucket_idx = hash(key) % len(self.buckets)
self.buckets[bucket_idx] = Hashtable.Node(key, val, self.buckets[bucket_idx])
def __getitem__(self, key):
bucket_idx = hash(key) % len(self.buckets)
n = self.buckets[bucket_idx]
while n:
if n.key == key:
return n.val
raise KeyError
def __contains__(self, key):
try:
_ = self[key]
return True
except:
return False
```

In [34]:

```
h = Hashtable(1)
h['batman'] = 'bruce wayne'
h['superman'] = 'clark kent'
```

In [35]:

```
n = h.buckets[0]
while n:
print(n.key, n.val)
n = n.next
```

In [36]:

```
def prep_ht(size):
ht = Hashtable(size*10)
for x in range(size):
ht[x] = x
return ht
def time_ht(size):
return timeit.timeit('ht[random.randrange({})]'.format(size),
'import random ; from __main__ import prep_ht ;'
'ht = prep_ht({})'.format(size),
number=100)
ht_timings = [time_ht(n)
for n in range(10, 10000, 100)]
```

In [37]:

```
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(ht_timings, 'gs')
plt.plot(dict_timings, 'b^')
plt.show()
```

In [43]:

```
class Hashtable(Hashtable):
def __iter__(self):
for b in self.buckets:
if not b:
next
else:
while b:
yield (b.key, b.val)
b = b.next
```

In [44]:

```
h = Hashtable()
for k, v in (('batman', 'bruce wayne'), ('superman', 'clark kent'), ('spiderman', 'peter parker')):
h[k] = v
```

In [47]:

```
for k, v in h:
print(k, ' => ', v)
```

(Why out of order?)

It doesn't often make sense to start with a large number of buckets, unless we know in advance that the number of keys is going to be vast — also, the user of the hashtable would typically prefer to not be bothered with implementation details (i.e., bucket count) when using the data structure.

Instead: start with a relatively small number of buckets, and if the ratio of keys to the number of buckets (known as the **load factor**) is above some desired threshold — which we can determine using collision probabilities — we can dynamically increase the number of buckets. This requires, however, that we *rehash* all keys and potentially move them into new buckets (since the `hash(key) % num_buckets`

mapping will likely be different with more buckets).

- FIXED
`__setitem__`

(to update value for existing key) `__delitem__`

`keys`

&`values`

(return iterators for keys and values)`setdefault`

For a hashtable with $N$ key/value entries:

- Insertion: $O(N)$ (pathological case: if all keys hash to the same bucket = linear search!)
- Lookup: $O(N)$ (same as above)
- Deletion: $O(N)$ (same as above)

*But*, if hash codes are uniformly distributed across buckets, we can predict likelihood of collisions using birthday-problem-like analysis. With no (or minimal) collisions, lookups (and deletions) are practically constant time! Tradeoff: require a relatively large number of buckets — many of which aren't populated — and a concomitantly low load factor to pull this off.

- hashtable
- hashing and hashes
- collision
- hash buckets & chains
- birthday problem
- load factor
- rehashing

Remember: *a given object must always hash to the same value*. This is required so that we can always map the object to the same hash bucket.

Hashcodes for collections of objects are usually computed from the hashcodes of its contents, e.g., the hash of a tuple is a function of the hashes of the objects in said tuple:

In [68]:

```
hash(('two', 'strings'))
```

Out[68]:

This is useful. It allows us to use a tuple, for instance, as a key for a hashtable.

However, if the collection of objects is *mutable* — i.e., we can alter its contents — this means that we can potentially change its hashcode.`

If we were to use such a collection as a key in a hashtable, and alter the collection after it's been assigned to a particular bucket, this leads to a serious problem: the collection may now be in the wrong bucket (as it was assigned to a bucket based on its original hashcode)!

For this reason, only immutable types are, by default, hashable in Python. So while we can use integers, strings, and tuples as keys in dictionaries, lists (which are mutable) cannot be used. Indeed, Python marks built-in mutable types as "unhashable", e.g.,

In [70]:

```
hash([1, 2, 3])
```

In [89]:

```
class Student:
def __init__(self, fname, lname):
self.fname = fname
self.lname = lname
```

In [90]:

```
s = Student('John', 'Doe')
hash(s)
```

Out[90]:

In [91]:

```
s.fname = 'Jane'
hash(s) # same as before mutation
```

Out[91]:

We can change the default behavior by providing our own hash function in `__hash__`

, e.g.,

In [92]:

```
class Student:
def __init__(self, fname, lname):
self.fname = fname
self.lname = lname
def __hash__(self):
return hash(self.fname) + hash(self.lname)
```

In [93]:

```
s = Student('John', 'Doe')
hash(s)
```

Out[93]:

In [94]:

```
s.fname = 'Jane'
hash(s)
```

Out[94]: