Key Word(s): Dynamic arrays, memory layouts
In Python
you can append to lists. So what's an ob_size
doing in our struct then?
typedef struct {
long ob_refcnt;
PyTypeObject *ob_type;
Py_ssize_t ob_size;
PyObject **ob_item;
long allocated;
} PyListObject;
Turns out Python lists are implemented in something called a dynamic array.
Big O Notation¶
- Part of a larger notion of limiting behavior of functions
- Including Small O, Big Theta, Big Omega, Small Omega
- For us, it just means "bounded above"
- More strictly: $$f\left(n\right) = \mathcal{O}\left(g\left(n\right)\right)$$ means $\left|f\right|$ is bounded above by $g$ asymptotically.
Example: Binary Search Trees¶
- We have stated that the time-complexity of BSTs is $\mathcal{O}\left(\log\left(n\right)\right)$
- This means that the time-complexity is bounded above by $\log\left(n\right)$.
A complete binary tree has $$n = 1 + 2 + \ldots + 2^{h-1} + 2^{h}$$ nodes where $h$ is the height of the tree.
We can write this as $$n=2^{h+1} - 1.$$
Solving for the height gives $$h = \log_{2}\left(n+1\right) - 1.$$
So the height scales like $\log_{2}\left(n\right)$.
Going just a little bit further we have \begin{align*} h &= \log_{2}\left(n+1\right) - 1 \\ & < \log_{2}\left(n+1\right). \end{align*}
So the height is bounded from above by $\log_{2}\left(n+1\right)$.
But asymptotically, as $n\to\infty$, $\log_{2}\left(n+1\right)$ behaves like $\log_{2}\left(n\right)$.
Arrays¶
A static array is a contiguous slab of memory of known size, such that n
items can fit in. This is a great data structure. Why?
- constant time index access:
a[i]
is $\mathcal{O}(1)$- Just go directly to
i * sizeof(int)
- Just go directly to
- linear time traversal or search: $1$ unit / loop iteration means $\mathcal{O}(n)$ in loop.
- locality in memory: it's one
int
after another
Tuples in Python
are fixed size, static arrays.
But the big problem is, what if we want to add something more beyond the end of the array?
Then we must use dynamic arrays.
Note that this shifts the focus more to space complexity (rather than time complexity).
Dynamic Arrays¶
What Python
does is first create a fixed size array of these PyListObject*
pointers on the heap. Then, as you append, it uses its own algorithm to figure out when to expand the size of the array.
/* This over-allocates proportional to the list size, making
* room for additional growth. The over-allocation is mild,
* but is enough to give linear-time amortized behavior over a
* long sequence of appends() in the presence of a poorly-
* performing system realloc(). The growth pattern is: 0, 4,
* 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
alist = [1,2,3,4]
alist.append(5)
alist
import sys
sys.getsizeof([]),sys.getsizeof([1]),sys.getsizeof([1,1]), sys.getsizeof([1,1,1])
An empty list is 64 bytes.
Each int
adds an 8 byte pointer.
a = []
sys.getsizeof(a)
a.append(1)
sys.getsizeof(a)
a.append(1)
sys.getsizeof(a)
for j in range(16):
a.append(1)
print(a, " ", sys.getsizeof(a))
Note: The growth factor is less than $2$! This is quite technical, but here is a reference: https://en.wikipedia.org/wiki/Dynamic_array#Growth_factor.
Performance of Dynamic Arrays¶
Let's assume we start with an array of size of $1$ (one slot) and then double the size each time. After $n$ doublings, we have an array with $2^n$ slots. So, it then takes $\log(n)$ doublings for the array to have $n$ slots (note, $\log(n)$ means $\log_{2}(n)$).
$$s = 2^{d} \qquad \text{# of slots}$$$$d = \log_{2}(s)\qquad \text{# of doublings}$$$$d_{n} = \log_{2}(n)$$Notice that we might not get the continuously allocated memory we want. So we'll have to recopy to a larger array.
The last $n/2$ numbers in the array don't move at all (they're the new ones). The previous $n/4$ numbers in the array would have moved once, the previous $n/8$ twice, and so on.
Therefore, the $i$-th chunk of numbers will have moved $$i \dfrac{n}{2^{i+1}}$$ times. Show this!!
We need to add all these movements up.
Thus the total number of movements is
$$\sum_{i=1}^{log_{2}(n)} i\frac{n}{2^{i+1}} \leq \frac{n}{2} \sum_{i=1}^{\infty} \frac{i}{2^i} = n.$$This is an amazing result. The work of reallocation is still $\mathcal{O}(n)$ on the average, as if a single array had been allocated in advance!
Here's the calculation.
\begin{align*} \sum_{i=1}^{\infty}{ix^{i}} &= \sum_{i=0}^{\infty}{\left(i+1\right)x^{i+1}} \\ &= x\sum_{i=0}^{\infty}{\left(i+1\right)x^{i}} \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\sum_{i=0}^{\infty}{x^{i+1}} \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\left[x\sum_{i=0}^{\infty}{x^{i}}\right] \\ &= x\frac{\mathrm{d}}{\mathrm{d}x}\left[\frac{x}{1-x}\right] = \frac{x}{\left(1-x\right)^{2}}. \end{align*}When $x = 1/2$ we have $$\sum_{i=0}^{\infty}{\left(i+1\right)x^{i+1}} = 2.$$ So, to summarize, $$\dfrac{n}{2}\sum_{i=1}^{\infty}{\dfrac{i}{2^{i}}} = \dfrac{n}{2}\sum_{i=0}^{\infty}{\dfrac{i+1}{2^{i+1}}} = \dfrac{n}{2}\times 2 = n.$$
Containers vs Flats¶
Earlier we saw how Python
lists contained references to integer ("digit")+metdata based structs on the heap.
We call sequences that hold such "references" to objects on the heap Container Sequences. Examples of such container sequences are list
, tuple
, collections.deque
.
There are collections in Python
which contain contiguous "typed" memory (which itself is allocated on the heap). We call these Flat Sequences. Such containers in Python 3
are: str
, bytes
, bytearray
, memoryview
, array.array
.
You have probably extensively used a type of flat sequence not mentioned yet. This is NumPy
's ndarray: np.array
.
All of these are faster as they work with contiguous blocks of uniformly formatted memory.
From Fluent Python:
Container sequences hold references to the objects they contain, which may be of any type, while flat sequences physically store the value of each item within its own memory space, and not as distinct objects. Thus, flat sequences are more compact, but they are limited to holding primitive values like characters, bytes, and numbers.
The data structures we have discussed fall into two general classes:
Contiguously-allocated structures are composed of single slabs of memory, and include arrays, matrices, heaps, and hash tables. These are the Flat Sequences we described above.
Linked data structures are composed of independent chunks of memory bound together by pointers, and include lists, trees, and graph adjacency lists. These are the Container Sequences we described above.
(Steven S Skiena. The Algorithm Design Manual)
- A critical advantage of something like a contiguous memory array is that indexing is a constant time operation, as opposed to worst-case $\mathcal{O}(n)$, as we saw in linked lists.
- Other benefits include a tighter size and a locality of memory which benefits cache and general memory transport.
Mutable vs Immutable¶
The mutability of objects has recurred often in this course.
One can also study containers based on their mutability.
Mutable sequences in Python 3
are: list, bytearray, array.array, collections.deque, memoryview
.
Immutable sequences in Python 3
are tuple, str, bytes
.
Let's learn about some of these collections in Python
.
array.array¶
The list
type is nice and very flexible, but if you need to store many many (millions) of floating point variables, array.array
is a better option.
array.array
stores just the bytes representing the type, so its just like a contiguous C
array of things in RAM, and also just like a numpy
array.
array.array
is mutable, and you don't need to allocate ahead of time (reallocation will be done).
from array import array
from random import random
#generator expression instead of list comprehension
floats_aa = array('d', (random() for i in range(10**9)))
print(floats_aa.itemsize, " ", type(floats_aa), " ", floats_aa[5])
Now let's do the same thing with a list comprehension:
floats_list = [random() for i in range(10**9)]
%%time
for f in floats_aa:
pass
%%time
for f in floats_list:
pass
A Few Observations¶
- Looks like a regular
Python
list on a billion floats only costs double (seems like it should be even slower...). - Why is accessing floats in an
array.array
so slow?- Because each float is boxed by the
Python
runtime. You saw this earlier!
- Because each float is boxed by the
In an array.array
or in numpy.array
for that matter, when you "iterate" over the array, and use the int
s you get, what Python
does is to take that 32 bits or 64 bits from memory, wrap it up into one of these structs, and hand it to you. You asked for a Python
int
after all.
Operations on array.array
which can be done with C
are fast; access into Python
is slow.
This is why numpy.ndarray
is written in C
, with operations like numpy.dot
written in C
.
(None of the array.array
functionality is exposed with any complex operations under the hood, so its current use remains limited.)
If you want to use numerical stuff, use numpy
arrays.
See https://www.python.org/doc/essays/list2str/ for a discussion on when it's a good idea to use array.array
.
memoryviews¶
Memoryviews, inspired by NumPy
and SciPy
, let you handle slices of arrays without expensively copying bytes.
Travis Oliphant, as quoted in Fluent:
A memoryview is essentially a generalized NumPy array structure in Python itself (without the math). It allows you to share memory between data-structures (things like PIL images, SQLlite databases, NumPy arrays, etc.) without first copying. This is very important for large data sets.
Additional Details on Dynamic Arrays¶
The cartoons below are meant to help you see how to count the number of movements. The variable $d$ reperesents the number of doublings.
Doublings¶
After $4$ doublings, the array looks something like, The numbers in each cell represent the number of times that cell was moved (or copied). These numbers are not the numbers in the array.
By the time the fourth doubling occurs, the first cell has been copied $4$ times, the next cell has been copied $3$ times, the next $2$ cells have been copied $2$ times each and so on.
Bookkeeping¶
The figure below shows how you might count the number of movements in each chunk. None of the cells in the most recent chunk have been moved yet. Each cell in the second-most recent chunk have been copied one time, which corresponds to $4$ total movements in that chunk. This same pattern can be observed in all the chunks.
Note that the very first chunk has been copied $d$ times (the number of doublings).
Addition¶
Now we simply add all of these movements up. The first chunk has moved $d$ times, the second chunk has moved $d-1\times\dfrac{n}{2^{d}}$ times, etc. The final result is illustrated in the figure below.
I/O Intro¶
- Input Files
- XML
- YAML
- JSON
- Comments on pickling
from IPython.display import HTML
Input Files and Parsing¶
We usually want to read data into our software:
- Input parameters to the code (e.g. time step, linear algebra solvers, physical parameters, etc)
- Input fields (e.g. fields to visualize)
- Calibration data
- $\vdots$
This data can be provided by us, or the client, or come from a database somewhere.
There are many ways of reading in and parsing data. In fact, this is often a non-trivial exercise depending on the quality of the data as well as its size.
XML Intro¶
id="test_mechanism">
reversible="yes" type="Elementary" id="reaction01">
H + O2 [=] OH + O
units="cm3/mol/s">3.52e+16
-0.7
units="kJ/mol">71.4
H:1 O2:1
OH:1 O:1
reversible="yes" type="Elementary" id="reaction02">
H2 + O [=] OH + H
units="cm3/mol/s">5.06e+4
2.7
units="kJ/mol">26.3
H2:1 O:1
OH:1 H:1
What is XML?¶
Note: Material presented here taken from the following sources
Some basic XML
comments:
- XML stands for
Extensible Markup Language
- XML is just information wrapped in tags
- It doesn't do anything per se
- Its format is both machine- and human-readable
Some Basic XML
Anatomy¶
id="dog1">
Cloe
3
Border Collie
Yes
id="dog2">
Karl
7
Beagle
Yes
Note that all XML
elements have a closing tag!
Some More Basic XML
Anatomy¶
See w3schools XML tutorial for a very nice summary of the essential XML
rules.
XML
elements: a few things to be aware of:
- Elements can contain text, attributes, and other elements
XML
names are case sensitive and cannot contain spaces- Be consistent in your naming convention
XML
attributes: a few things to be aware of:
XML
attributes must be in quotes- There are no rules about when to use elements or attributes
- You could make an attribute an element and it might read better
- Rule of thumb: Data should be stored as elements. Metadata should be stored as attributes.
Python and XML
¶
We will use the ElementTree
class to read in and parse XML
input files in Python
.
A very nice tutorial can be found in the Python
ElementTree
documentation.
We'll work with the shelterdogs.xml
file to start.
import xml.etree.ElementTree as ET
tree = ET.parse('shelterdogs.xml')
dogshelter = tree.getroot()
print(dogshelter)
print(dogshelter.tag)
print(dogshelter.attrib)
Looping Over Child Elements¶
for child in dogshelter:
print(child.tag, child.attrib)
Accessing Children by Index¶
print(dogshelter[0][0].text)
print(dogshelter[1][0].text)
print(dogshelter[0][2].text)
The Element.iter()
Method¶
From the documentation:
Creates a tree iterator with the current element as the root. The iterator iterates over this element and all elements below it, in document (depth first) order.
for age in dogshelter.iter('age'):
print(age.text)
The Element.findall()
Method¶
From the documentation:
Finds all matching subelements, by tag name or path. Returns a list containing all matching elements in document order.
print(dogshelter.findall('dog'))
for dog in dogshelter.findall('dog'): # Iterate over each child
print('ID: {}'.format(dog.get('id'))) # Use the get() method to get the attribute of the child
print('----------')
print('Name: {}'.format(dog.find('name').text)) # Use the find() method to find a specific subchild
age = float(dog.find('age').text)
if (dog.find('age').attrib == 'months'):
years = age / 12.0
print('Age: {} years'.format(years))
else:
print('Age: {} years'.format(age))
print('Breed: {}'.format(dog.find('breed').text))
if (dog.find('playgroup').text.split()[0] == 'Yes'):
print('PLAYGROUP')
else:
print('NO PLAYGROUP')
print('\n::::::::::::::::::::')
What is JSON?¶
- Stands for JavaScript Object Notation
- It's actually language agnostic
- No need to learn JavaScript to use it
- Like XML, it's a human-readable format
Some Basic JSON
Anatomy¶
{
"dogShelter": "MSPCA-Angell",
"dogs": [
{
"name": "Cloe",
"age": 3,
"breed": "Border Collie",
"attendPlaygroup": "Yes"
},
{
"name": "Karl",
"age": 7,
"breed": "Beagle",
"attendPlaygroup": "Yes"
}
]
}
JSON
and Python
¶
Python
supportsJSON
natively- Saving
Python
data toJSON
format is called serialization - Loading a
JSON
file intoPython
data is called deserialization
Deserialization¶
Since we're interested in reading in some fancy input file, we'll begin by discussing deserialization.
We'll work with the shelterdogs.json
file.
import json
with open ("shelterdogs.json", "r") as shelterdogs_file:
shelterdogs = json.load(shelterdogs_file)
print(shelterdogs["dogs"])
print(type(shelterdogs))
Comments on Deserialization¶
That was pretty nice! We got a Python
dictionary out. We sure know how to work with Python
dictionaries.
Serialization¶
You can also write data out to JSON
format. Let's just do a brief example.
somedogs = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"},
{"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}]}
with open("shelterdogs_write.json", "w") as write_dogs:
json.dump(somedogs, write_dogs, indent=4)
Some JSON
References¶
What is YAML
?¶
- The official website:
YAML
- From the official website:
YAML
stands for YAML Ain't Markup Language- Example of a recursive acronym (like Linux!)
- "What It Is: YAML is a human-friendly data serialization standard for all programming languages."
- YAML is quite friendly to use and continues to gain in popularity
YAML
Anatomy¶
shelterDogs:
- {age: 3, attendPlaygroup: 'Yes', breed: Border Collie, name: Cloe}
- {age: 7, attendPlaygroup: 'Yes', breed: Beagle, name: Karl}
shelterStaff:
- {Job: dogWalker, age: 100, name: Bob}
- {Job: PlaygroupLeader, age: 47, name: Sally}
someshelter = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"},
{"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}],
"shelterStaff": [{"name": "Bob", "age": 100, "Job": "dogWalker"},
{"name": "Sally", "age": 47, "Job": "PlaygroupLeader"}]}
import yaml # Use conda install -c anaconda yaml if you need to install it
print(yaml.dump(someshelter))
Serialization¶
with open("shelter_write.yaml", "w") as write_dogs:
yaml.dump(someshelter, write_dogs)
Deserialization¶
with open ("shelterdogs.yaml", "r") as shelter_dogs:
some_shelter = yaml.load(shelter_dogs)
print(some_shelter)
print(some_shelter["shelterStaff"])
What is pickle
?¶
Python
has it's own module for loading and writingpython
data- Part of the
python
standard library - Fast
- Can store arbitrarily complex
Python
data structures
Some caveats¶
Python
specific: no guarantee of cross-language compatibility- Not every
python
datastructure can be serialized bypickle
- Older versions of
python
don't support newer serialization formats- Lastest format can handle the most
python
datastructures - They can also read in older datastructures
- Older formats cannot read in newer formats
- Lastest format can handle the most
- Make sure to use binary mode when opening
pickle
files- Data will get corrupted otherwise
import pickle
someshelter = {"shelterDogs": [{"name": "Cloe", "age": 3, "breed": "Border Collie", "attendPlaygroup": "Yes"},
{"name": "Karl", "age": 7, "breed": "Beagle", "attendPlaygroup": "Yes"}],
"shelterStaff": [{"name": "Bob", "age": 100, "Job": "dogWalker"},
{"name": "Sally", "age": 47, "Job": "PlaygroupLeader"}]}
with open('data.pickle', 'wb') as f:
pickle.dump(someshelter, f, pickle.HIGHEST_PROTOCOL) # highest protocol is the most recent one
with open('data.pickle', 'rb') as f:
data = pickle.load(f)
print(data)
%%bash
cat "data.pickle"