Lecture 18

Datastructures III

Tuesday, November 5th 2019

Last time:

  • Iterators and Iterables
  • Trees, Binary trees, and BSTs

This time:

  • BST Traversal
  • Heaps
  • Start Generators

BST Traversal

  • We've stored our data in a BST
  • This seemed like a good idea at the time because BSTs have some nice properties
  • To be able to access/use our data, we need to be able to traverse the tree

Traversal Choices

There are three traversal choices based on an implicit ordering of the tree from left to right:

  1. In-order: Traverse left-subtree, then current root, then right sub tree
  2. Post-order: Traverse left subtree, then traverse right subtree, and then current root
  3. Pre-order: Current root, then traverse left subtree, then traverse right subtree
  • Traversing a tree means performing some operation
  • In our examples, the operation will be "displaying the data"
  • However, an operation could be "deleting files"

Exercise 1

Heaps

We listed several types of data structures at the beginning of our data structures unit.

So far, we have discussed lists and trees (in particular binary trees and binary search trees).

Heaps are a type of tree, a little different from binary trees.

Some Motivation: priority queues

  • People may come to your customer service counter in a certain order, but you might want to serve your VIPs first!
  • In other words, there is an "ordering" on your customers and you want to serve people in the order of the most VIP.
  • This problem requires us to then sort things by importance and then evaluate things in this sorted order.
  • A priority queue is a data structure for this, which allows us to do things more efficiently than simple sorting every time a new thing comes in.

Items are inserted at one end and deleted from the other end of a queue.

The basic priority queue is defined to be supporting three primary operations:

  1. Insert: insert an item with "key" (e.g. an importance) $k$ into priority queue $Q$.
  2. Find Minimum: get the item whose key value is smaller than any other key in $Q$.
    • Note: Depending on implementation, this may also be Find Maximum.
  3. Delete Minimum: Remove the item with minimum $k$ from $Q$.

Comments on Implementation of Priorty Queues

One could use an unsorted array and store a pointer to the minimum index; accessing the minimum is an $\mathcal{O}(1)$ operation.

  • It's cheap to update the pointer when new items are inserted into the array because we update it in $\mathcal{O}(1)$ only when the new value is less than the current one.
  • Finding a new minimum after deleting the old one requires a scan of the array ($\mathcal{O}(n)$ operation) and then resetting the pointer.

One could alternatively implement the priority queue with a balanced binary tree structure. Then we'll get performance of $\displaystyle\mathcal{O}\left(\log(n)\right)$!

This leads us to heaps. Heaps are a type of balanced binary tree.

  • A heap providing access to minimum values is called a min-heap
  • A heap providing access to maximum values is called a max-heap
  • Note that you can't have a min-heap and max-heap together

Priority queues are often implemented using heaps.

Heapsort

  • Implementing a priority queue with selection sort takes $\displaystyle\mathcal{O}\left(n^{2}\right)$ operations
  • Using a heap takes $\mathcal{O}(n\log(n))$ operations

Implementing a sorting algorithm using a heap is called heapsort.

Heapsort is an in-place sort and requires no extra memory.

Note that there are many sorting algorithms nowadays. Python uses Timsort.

Let's get back to heaps.

A heap has two properties.

  1. Shape property
    • A leaf node at depth $k>0$ can exist only if all the nodes at the previous depth exist. Nodes at any partially filled level are added "from left to right".
  1. Heap property
    • For a min-heap, each node in the tree contains a key less than or equal to either of its two children (if they exist).
      • This is also known as the labeling of a "parent node" dominating that of its children.
  1. Heap property
    • For a max-heap, a parent node must be greater-than-or-equal to its children.

Heap Mechanics

  • Heaps are a special binary tree that can be stored in arrays
    • This is more memory-efficient than the Node class and pointer logic used in BSTs
  • The first element in the array is the root key.
  • The next two elements make up the first level of children. This is done from left to right.
  • Then the next four and so on.

Note: If a parent node is at index $i$, then its children will be at indices $2i$ or $(2i+1)$.

Question: What if in the indexing starts from $0$?

Construct a Heap

To construct a heap, insert each new element that comes in at the left-most open spot.

This maintains the shape property but not the heap property.

Restore the Heap Property by "Bubbling Up"

Look at the parent and if the child "dominates" we swap parent and child. Repeat this process until we bubble up to the root.

Identifying the dominant is now easy because it will be at the top of the tree.

This process is called heapify and must also be done at the first construction of the heap.

Deletion

Removing the dominant key creates a hole at the top (the first position in the array).

Fill this hole with the rightmost position in the array, or the rightmost leaf node.

This destroys the heap property!

So we now bubble this key down until it dominates all its children.

Exercise 2

Iterables/Iterators Again

We have been discussing data structures and simultaneously exploring iterators and iterables.

In [1]:
class SentenceIterator:
    def __init__(self, words): 
        self.words = words 
        self.index = 0
        
    def __next__(self): 
        try:
            word = self.words[self.index] 
        except IndexError:
            raise StopIteration() 
        self.index += 1
        return word 

    def __iter__(self):
        return self

class Sentence: # An iterable
    def __init__(self, text): 
        self.text = text
        self.words = text.split()
        
    def __iter__(self):
        return SentenceIterator(self.words)
    
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

Example Usage

In [2]:
a = Sentence("Dogs will save the world and cats will eat it.")
for item in a:
    print(item)
Dogs
will
save
the
world
and
cats
will
eat
it.
In [3]:
print("\n")
it = iter(a) # it is an iterator
while True:
    try:
        nextval = next(it)
        print(nextval)
    except StopIteration:
        del it
        break

Dogs
will
save
the
world
and
cats
will
eat
it.

Every collection in Python is iterable.

We have already seen iterators are used to make for loops. They are also used to make other collections:

  • To loop over a file line by line from disk
  • In the making of list, dict, and set comprehensions
  • In unpacking tuples
  • In parameter unpacking in function calls (*args syntax)

An iterator defines both __iter__ and a __next__ (the first one is only required to make sure an iterator is an iterable).

Recap: An iterator retrieves items from a collection. The collection must implement __iter__.

Generators

  • A generator function looks like a normal function, but yields values instead of returning them.
  • The syntax is (unfortunately) the same otherwise (PEP 255 -- Simple Generators).
  • A generator is a different beast from a function. When a function with a yield keyword in it runs, it creates a generator.
  • The generator is an iterator and gets an internal implementation of __iter__ and __next__.
In [4]:
def gen123():
    print("A")
    yield 1
    print("B")
    yield 2
    print("C")
    yield 3

g = gen123()

print(gen123, "   ", type(gen123), "   ", type(g))
          
In [5]:
print("A generator is an iterator.")
print("It has {} and {}".format(g.__iter__, g.__next__))
A generator is an iterator.
It has  and 

Some notes on generators

  • When next is called on the generator, the function proceeds until the first yield.
  • The function body is now suspended and the value in the yield is then passed to the calling scope as the outcome of the next.
  • When next is called again, it gets __next__ called again (implicitly) in the generator, and the next value is yielded.
  • This continues until we reach the end of the function, the return of which creates a StopIteration in next.

Any Python function that has the yield keyword in its body is a generator function.

In [6]:
print(next(g))
print("\n")
print(next(g))
print("\n")
print(next(g))
print("\n")
print(next(g))
A
1


B
2


C
3


---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
 in 
      5 print(next(g))
      6 print("\n")
----> 7 print(next(g))

StopIteration: 

More notes on generators

  • Generators yield one item at a time
  • In this way, they feed the for loop one item at a time
In [7]:
for i in gen123():
    print(i, "\n")
A
1 

B
2 

C
3