Key Word(s): Dynamic arrays, memory layouts, I/O

Download Notebook

Lecture 19¶

Tuesday, November 10th 2020¶

Dynamic Arrays, Python Internals, I/O¶

Last time:¶

Generators
How generators work
Start on Python internals

This time:¶

Dynamic arrays
How things work
Overview of some I/O

In Python you can append to lists. So what's an ob_size doing in our struct then?

typedef struct {
    long ob_refcnt;
    PyTypeObject *ob_type;
    Py_ssize_t ob_size;
    PyObject **ob_item;
    long allocated;
} PyListObject;

Turns out Python lists are implemented in something called a dynamic array.

Big O Notation¶

Part of a larger notion of limiting behavior of functions
- Including Small O, Big Theta, Big Omega, Small Omega

For us, it just means "bounded above"

More strictly: $$f\left(n\right) = \mathcal{O}\left(g\left(n\right)\right)$$ means $\left|f\right|$ is bounded above by $g$ asymptotically.

Example: Binary Search Trees¶

We have stated that the time-complexity of BSTs is $\mathcal{O}\left(\log\left(n\right)\right)$

This means that the time-complexity is bounded above by $\log\left(n\right)$.

A complete binary tree has $$n = 1 + 2 + \ldots + 2^{h-1} + 2^{h}$$ nodes where $h$ is the height of the tree.

We can write this as $$n=2^{h+1} - 1.$$

Solving for the height gives $$h = \log_{2}\left(n+1\right) - 1.$$

So the height scales like $\log_{2}\left(n\right)$.

Going just a little bit further we have \begin{align*} h &= \log_{2}\left(n+1\right) - 1 \\ & < \log_{2}\left(n+1\right). \end{align*}

So the height is bounded from above by $\log_{2}\left(n+1\right)$.

But asymptotically, as $n\to\infty$, $\log_{2}\left(n+1\right)$ behaves like $\log_{2}\left(n\right)$.

Arrays¶

A static array is a contiguous slab of memory of known size, such that n items can fit in. This is a great data structure. Why?

constant time index access: a[i] is $\mathcal{O}(1)$
- Just go directly to i * sizeof(int)

linear time traversal or search: $1$ unit / loop iteration means $\mathcal{O}(n)$ in loop.

locality in memory: it's one int after another

Tuples in Python are fixed size, static arrays.

But the big problem is, what if we want to add something more beyond the end of the array?

Then we must use dynamic arrays.

Note that this shifts the focus more to space complexity (rather than time complexity).

Dynamic Arrays¶

What Python does is first create a fixed size array of these PyListObject* pointers on the heap. Then, as you append, it uses its own algorithm to figure out when to expand the size of the array.

listobject.c

/* This over-allocates proportional to the list size, making room
* for additional growth.  The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* Add padding to make the allocated size multiple of 4.
* The growth pattern is:  0, 4, 8, 16, 24, 32, 40, 52, 64, 76, ...
* Note: new_allocated won't overflow because the largest possible value
*       is PY_SSIZE_T_MAX * (9 / 8) + 6 which always fits in a size_t.
*/
new_allocated = ((size_t)newsize + (newsize >> 3) + 6) & ~(size_t)3;

In [1]:

alist = [1, 2, 3, 4]
alist.append(5)
alist

Out[1]:

[1, 2, 3, 4, 5]

In [2]:

import sys
sys.getsizeof([]),sys.getsizeof([1]),sys.getsizeof([1,1]), sys.getsizeof([1,1,1])

Out[2]:

(72, 80, 88, 96)

An empty list is 72 bytes.

Each int adds an 8 byte pointer.

In [3]:

aa = []
sys.getsizeof(aa)

Out[3]:

In [4]:

aa.append(1)
sys.getsizeof(aa)

Out[4]:

In [5]:

aa.append(1)
sys.getsizeof(aa)

Out[5]:

In [10]:

aa = []
for j in range(20):
    aa.append(1)
    print(aa, "   ", sys.getsizeof(aa))

[1]     104
[1, 1]     104
[1, 1, 1]     104
[1, 1, 1, 1]     104
[1, 1, 1, 1, 1]     136
[1, 1, 1, 1, 1, 1]     136
[1, 1, 1, 1, 1, 1, 1]     136
[1, 1, 1, 1, 1, 1, 1, 1]     136
[1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     200
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     272
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     272
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     272
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]     272

Note: The growth factor is less than $2$! This is quite technical, but here is a reference: https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor.

Performance of Dynamic Arrays¶

Let's assume we start with an array of size of $1$ (one slot) and then double the size each time. After $n$ doublings, we have an array with $2^n$ slots.

The cartoons below are meant to help you see how to count the number of movements. The variable $d$ represents the number of doublings.

Doublings¶

After $4$ doublings, the array looks something like, The numbers in each cell represent the number of times that cell was moved (or copied). These numbers are not the numbers in the array.

It then takes $\log(n)$ doublings for the array to have $n$ slots (note, $\log(n)$ means $\log_{2}(n)$).

$$s = 2^{d} \qquad \text{# of slots}$$$$d = \log_{2}(s)\qquad \text{# of doublings}$$$$d_{n} = \log_{2}(n)$$

Notice that we might not get the contiguosly allocated memory we want. So we'll have to recopy to a larger array.

By the time the fourth doubling occurs, the first cell has been copied $4$ times, the next cell has been copied $3$ times, the next $2$ cells have been copied $2$ times each and so on.

The last $n/2$ numbers in the array don't move at all (they're the new ones). The previous $n/4$ numbers in the array would have moved once, the previous $n/8$ twice, and so on.

Bookkeeping¶

The figure below shows how you might count the number of movements in each chunk. None of the cells in the most recent chunk have been moved yet. Each cell in the second-most recent chunk has been copied one time, which corresponds to $4$ total movements in that chunk. This same pattern can be observed in all the chunks. Note: $i=0$ starts at rightmost (most recent) chunk.

Note that the very first chunk has been copied $d$ times (the number of doublings).

So the $i$-th chunk of numbers will have moved $$i \dfrac{n}{2^{i+1}}$$ times.

We need to add all these movements up.

Addition¶

Now we simply add all of these movements up. The first chunk has moved $d$ times, the second chunk has moved $\left(d-1\right)\times\dfrac{n}{2^{d}}$ times, etc.

Thus the total number of movements is

$$\sum_{i=1}^{log_{2}(n)} i\frac{n}{2^{i+1}} \leq \frac{n}{2} \sum_{i=1}^{\infty} \frac{i}{2^i} = n.$$

This is an amazing result. The work of reallocation is still $\mathcal{O}(n)$ on the average, as if a single array had been allocated in advance!

Here's the calculation.

\begin{align*} \sum_{i=1}^{\infty}{ix^{i}} &= \sum_{i=0}^{\infty}{\left(i+1\right)x^{i+1}} \\ &= x\sum_{i=0}^{\infty}{\left(i+1\right)x^{i}} \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\sum_{i=0}^{\infty}{x^{i+1}} \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\left[x\sum_{i=0}^{\infty}{x^{i}}\right] \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\left[\frac{x}{1-x}\right] = \frac{x}{\left(1-x\right)^{2}}. \end{align*}

When $x = 1/2$ we have $$\sum_{i=0}^{\infty}{\left(i+1\right)x^{i+1}} = 2.$$ So, to summarize, the total number of movements is, $$M = d + \sum_{i=1}^{\log_{2}\left(n\right)-1}{n\dfrac{i}{2^{i+1}}} \leq d + \sum_{i=1}^{\infty}{n\dfrac{i}{2^{i+1}}} = d + \dfrac{n}{2}\sum_{i=1}^{\infty}{\dfrac{i}{2^{i}}} = d + \dfrac{n}{2}\sum_{i=0}^{\infty}{\dfrac{i+1}{2^{i+1}}} = d + \dfrac{n}{2}\times 2 = d + n = \log_{2}\left(n\right) + n.$$ For large $n$, this behaves like $n$.

Breakout Room (10 mins)¶

Discuss the following:

Why we were counting the number of movements
Why this is such a cool result

Try to help each other understand the steps.

Containers vs Flats¶

Earlier we saw how Python lists contained references to integer ("digit")+metdata based structs on the heap.

We call sequences that hold such "references" to objects on the heap Container Sequences. Examples of such container sequences are list, tuple, collections.deque.

There are collections in Python which contain contiguous "typed" memory (which itself is allocated on the heap). We call these Flat Sequences. Such containers in Python 3 are: str, bytes, bytearray, memoryview, array.array.

You have probably extensively used a type of flat sequence not mentioned yet. This is NumPy's ndarray: np.array.

All of these are faster as they work with contiguous blocks of uniformly formatted memory.

From Fluent Python:

Container sequences hold references to the objects they contain, which may be of any type, while flat sequences physically store the value of each item within its own memory space, and not as distinct objects. Thus, flat sequences are more compact, but they are limited to holding primitive values like characters, bytes, and numbers.

The data structures we have discussed fall into two general classes:

Contiguously-allocated structures are composed of single slabs of memory, and include arrays, matrices, heaps, and hash tables. These are the Flat Sequences we described above.

Linked data structures are composed of independent chunks of memory bound together by pointers, and include lists, trees, and graph adjacency lists. These are the Container Sequences we described above.

(Steven S Skiena. The Algorithm Design Manual)

A critical advantage of something like a contiguous memory array is that indexing is a constant time operation, as opposed to worst-case $\mathcal{O}(n)$, as we saw in linked lists.
Other benefits include a tighter size and a locality of memory which benefits cache and general memory transport.

Mutable vs Immutable¶

The mutability of objects has recurred often in this course.

One can also study them based on their mutability.

Mutable sequences in Python 3 are: list, bytearray, array.array, collections.deque, memoryview.

Immutable sequences in Python 3 are tuple, str, bytes.

Let's learn about some of these collections in Python.

array.array¶

The list type is nice and very flexible, but if you need to store many many (millions) of floating point variables, array.array is a better option.

array.array stores just the bytes representing the type, so its just like a contiguous C array of things in RAM, and also just like a numpy array.

array.array is mutable, and you don't need to allocate ahead of time (reallocation will be done).

In [11]:

from array import array
from random import random
# generator expression instead of list comprehension
floats_aa = array('d', (random() for i in range(10**9)))
print(floats_aa.itemsize, "   ", type(floats_aa), "   ", floats_aa[5])

8     <class 'array.array'>     0.07379464440936112

Now let's do the same thing with a list comprehension:

In [12]:

floats_list = [random() for i in range(10**9)]

Let's compare some timing (warning, this may take a while on your machine!).

In [13]:

%%time
for f in floats_aa:
    pass

CPU times: user 31.2 s, sys: 15.6 s, total: 46.8 s
Wall time: 1min 12s

In [14]:

%%time
for f in floats_list:
    pass

CPU times: user 32.7 s, sys: 1min 2s, total: 1min 35s
Wall time: 2min 36s

A Few Observations¶

Looks like a regular Python list on a billion floats only costs double (seems like it should be even slower...).
Why is accessing floats in an array.array so slow?
- Because each float is boxed by the Python runtime. You saw this earlier!

In an array.array or in numpy.array for that matter, when you "iterate" over the array, and use the ints you get, what Python does is to take that 32 bits or 64 bits from memory, wrap it up into one of these structs, and hand it to you. You asked for a Python int after all.

Operations on array.array which can be done with C are fast; access into Python is slow.

This is why numpy.ndarray is written in C, with operations like numpy.dot written in C.

(None of the array.array functionality is exposed with any complex operations under the hood, so its current use remains limited.)

If you want to use numerical stuff, use numpy arrays.

See https://www.python.org/doc/essays/list2str/ for a discussion on when it's a good idea to use array.array.

memoryviews¶

Memoryviews, inspired by NumPy and SciPy, let you handle slices of arrays without expensively copying bytes.

Travis Oliphant, as quoted in Fluent:

A memoryview is essentially a generalized NumPy array structure in Python itself (without the math). It allows you to share memory between data-structures (things like PIL images, SQLlite databases, NumPy arrays, etc.) without first copying. This is very important for large data sets.