Key Word(s): Datastructures, Lists, Linked lists

Download Notebook

Lecture 15¶

Data Structures I¶

Tuesday, October 27th 2020¶

Data Structures¶

Computer programs don't only perform calculations; they also store and retrieve information

Data structures and the algorithms that operate on them are at the core of computer science

Data structures are quite general
- Any data representation and associated operations
- e.g. integers, floats, arrays, classes, ...

Need to develop a "toolkit" of data structures and know when/how to use the right one for a given problem

Changing a data structure in a slow program can work the same way an organ transplant does in a sick patient. Important classes of abstract data types such as containers, dictionaries, and priority queues, have many different but functionally equivalent data structures that implement them.

Changing the data structure does not change the correctness of the program, since we presumably replace a correct implementation with a different correct implementation. However, the new implementation of the data type realizes different tradeoffs in the time to execute various operations, so the total performance can improve dramatically.

Like a patient in need of a transplant, only one part might need to be replaced in order to fix the problem.

-Steven S Skiena. The Algorithm Design Manual

Common data structures¶

Lists
Stacks/queues
Hashes
Heaps
Trees

We'll focus on lists today.

We'll tour some data structures in Python.

First up: sequences.

Sequences and their Abstractions¶

What is a sequence?¶

Consider the notion of Abstract Data Types.

The idea there is that one data type might be implemented in terms of another, or some underlying code not even in Python.

As long as the interface and contract presented to the user is solid, we can change the implementation below.

The dunder methods in Python are used towards this purpose.

In Python a sequence is something that follows the "sequence protocol". An example of this is a Python list.

This entails defining the __len__ and __getitem__ methods, as we mentioned in previous lectures.

Example¶

In [1]:

alist = [1,2,3,4]
len(alist) # calls alist.__len__

Out[1]:

In [2]:

alist[2] # calls alist.__getitem__(2)

Out[2]:

Lists also support slicing¶

In [3]:

alist[2:4]

Out[3]:

[3, 4]

How does this work?¶

We will create a dummy sequence, which does not create any storage.

In [4]:

class DummySeq:
    # It just implements the protocol.
    def __len__(self):
        return 42
    
    def __getitem__(self, index):
        return index

In [5]:

d = DummySeq()
len(d)

Out[5]:

In [6]:

d[5]

Out[6]:

In [7]:

d[67:98]

Out[7]:

slice(67, 98, None)

The "slice object"¶

Slicing creates a slice object for us of the form slice(start, stop, step) and then Python calls seq.__getitem__(slice(start, stop, step)).

Two-dimensional slicing is also possible.

In [8]:

d[67:98:2,1]

Out[8]:

(slice(67, 98, 2), 1)

In [9]:

d[67:98:2,1:10]

Out[9]:

(slice(67, 98, 2), slice(1, 10, None))

So what is slice() exactly?

In [10]:

dir(slice)

Out[10]:

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'indices',
 'start',
 'step',
 'stop']

Interesting! There's an indices method!

In [11]:

help(slice.indices)

Help on method_descriptor:

indices(...)
    S.indices(len) -> (start, stop, stride)
    
    Assuming a sequence of length len, calculate the start and stop
    indices, and the stride length of the extended slice described by
    S. Out of bounds indices are clipped in a manner consistent with the
    handling of normal slices.

In [14]:

slice(1, 10, 2).indices(100)

Out[14]:

(1, 10, 2)

Of course, you would only need to use this if you don't have access to an underlying sequence.

In [15]:

# Adapted from Example 10-6 from Fluent Python
import numbers # See https://www.python.org/dev/peps/pep-3141/
import reprlib # like repr but w/ limits on sizes of returned strings

class NewSeq:
    def __init__(self, iterator):
        self._storage = list(iterator)
        
    def __repr__(self):
        components = reprlib.repr(self._storage)
        return 'NewSeq({})'.format(components)

    def __len__(self):
        return len(self._storage)
     
    def __getitem__(self, index):
        cls = type(self)
        if isinstance(index, slice):
            return cls(self._storage[index])
        elif isinstance(index, numbers.Integral): 
            return self._storage[index]
        else:
            msg = '{cls.__name__} indices must be integers' 
            raise TypeError(msg.format(cls=cls))

In [16]:

d2 = NewSeq(range(10))
len(d2)

Out[16]:

In [17]:

repr(d2)

Out[17]:

'NewSeq([0, 1, 2, 3, 4, 5, ...])'

In [18]:

d2

Out[18]:

NewSeq([0, 1, 2, 3, 4, 5, ...])

In [19]:

d2[4]

Out[19]:

In [20]:

d2[2:4]

Out[20]:

NewSeq([2, 3])

In [21]:

d2[1,4]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-86ebd9ff5e90> in <module>
----> 1 d2[1,4]

<ipython-input-15-b55bf2b4aeca> in __getitem__(self, index)
     22         else:
     23             msg = '{cls.__name__} indices must be integers'
---> 24             raise TypeError(msg.format(cls=cls))

TypeError: NewSeq indices must be integers

Breakout Room (15 mins)¶

Determine who stayed up latest last night (their local time).
Walk through the example and be ready to explain the following:
- The implementation of __repr__.
- The implementation of __getitem__.
  - What is numbers.Integral?
- The choice of error message. Is it sensible? How can we know?

Note: An explanation could involve demo-ing the method in a notebook cell!

In [23]:

l = list(range(10))

In [24]:

reprlib.repr(l)

Out[24]:

'[0, 1, 2, 3, 4, 5, ...]'

In [26]:

ind = slice(2,4,None)

In [27]:

isinstance(ind, slice)

Out[27]:

True

In [28]:

ind = 4

In [29]:

isinstance(ind, slice)

Out[29]:

False

In [30]:

isinstance(ind, numbers.Integral)

Out[30]:

True

In [31]:

Out[31]:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [32]:

l[4]

Out[32]:

In [33]:

l[4.0]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-799917c621e3> in <module>
----> 1 l[4.0]

TypeError: list indices must be integers or slices, not float

In [34]:

l[4.0, 3.0]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-9c08704c5d55> in <module>
----> 1 l[4.0, 3.0]

TypeError: list indices must be integers or slices, not tuple

Linked Lists¶

Remember, a name in Python points to its value.
We've seen lists whose last element is actually a pointer to another list.
This leads to the idea of a linked list, which we'll use to illustrate sequences.

Nested Pairs¶

Berkeley CS61a: Nested Pairs, this is the box and pointer notation.

In Python:

In [35]:

pair = (1,2)

This representation lacks a certain power. A generalization:

pair = (1, (-1, None))

Another generalization:

linked_list = (1, (2, (3, (4, None))))

The second example leads to something like: Recursive Lists.

Here's what things look like in PythonTutor: PythonTutor Example.

In [36]:

import IPython
IPython.display.IFrame('http://pythontutor.com/iframe-embed.html#code=ll%20%3D%20%281,%20%282,%20%283,%20%284,%20None%29%29%29%29&codeDivHeight=400&codeDivWidth=350&cumulative=false&curInstr=0&heapPrimitives=nevernest&origin=opt-frontend.js&py=3&rawInputLstJSON=%5B%5D&textReferences=false">', 
                      width=900, height=300)

Out[36]:

Quick Linked List implementation¶

In [37]:

empty_ll = None

def make_ll(first, rest): # Make a linked list
    return (first, rest)

def first(ll): # Get the first entry of a linked list
    return ll[0]

def rest(ll): # Get the second entry of a linked list
    return ll[1]

In [38]:

ll_1 = make_ll(1, make_ll(2, make_ll(3, empty_ll))) # Recursively generate a linked list

ll_1

Out[38]:

(1, (2, (3, None)))

In [39]:

my_ll = make_ll(10,ll_1) # Make another one
my_ll

Out[39]:

(10, (1, (2, (3, None))))

In [40]:

print(first(my_ll), "     ", rest(my_ll), "     ", first(rest(my_ll)))

10       (1, (2, (3, None)))       1

Some reasons for linked lists:¶

You allocate memory only when you want to use it.
Inserting a new element is cheaper than in a fixed size array
- Can be done with a constant number of operations!
Gateway to other pointer-like and hierarchical structures.

Comments about linked lists:¶

Not so useful in Python but can be useful in C/C++
There are singly-linked lists and doubly-linked lists
Larger memory footprint than arrays (need reference to next node.)
Can't access individual elements
Lose memory locality with linked lists