Key Word(s): Dask, Python, Parallel Computing



AC295: Advanced Practical Data Science

Lecture 4: Dask

Harvard University
Spring 2020
Instructors: Pavlos Protopapas
Author: Andrea Porelli and Pavlos Protopapas


Part 1: Scalable computing

1.1 Dask API

  • what makes it unique:
    • allows to work with larger datasets making it possible to parallelize computation
    • it is helpful when size increases and even when "simple" sorting and aggregating would otherwise spill on persistent memory
    • it simplifies the cost of using more complex infrastructure
    • it is easy to learn for data scientists with a background in the Python (similar syntax) and flexible
    • it helps applying distributed computing to data science project:
      • not of great help for small size datasets: complex operations can be done without spilling to disk and slowing down process. Actually it would generate greater overheads.
      • very useful for medium size dataset; it allows to work with medium size in local machine. Python was not designed to make sharing work between processes on multicore systems particularly easy. As a result, it can be difficult to take advantage of parallelism within Pandas.
      • essential for large datasets: Pandas, NumPy, and scikit-learn are not suitable at all for datasets of this size, because they were not inherently built to operate on distributed datasets. This provide help for using libraries
Dataset type Size range Fits in RAM? Fits on local disk?
Small dataset Less than 2–4 GB Yes Yes
Medium dataset Less than 2 TB No Yes
Large dataset Greater than 2 TB No No

Adapted from Data Science with Python and Dask

  • Dask consists of several different components and APIs, which can be categorized into three layers:
    • the scheduler: add_definition/explanation;
    • low-level APIs: add_definition/explanation;
    • and high-level APIs: add_definition/explanation .


Adapted from Data Science with Python and Dask

1.2 Directed acyclical graph (DAGs)

graph: is a representation of a set of objects that have a relationship with one another >>> good to represent a wide variety of information. A graph is compounded by:

- *node*: a function, an object or an action 
- *line*: symbolizes the relationship among nodes

directed acyclical graph: there is one logical way to traverse the graph. No node is visited twice.

cyclical graph: exists a feedback loop that allow to revisit and repeat the actions within the same node.

handle computational resources: as the problem we solve requires more resources we have two options:

-*scale up*: increase size of the available resource. invest in more efficient technology, cons diminishing returns
-*scale out*: add other resources (dask main idea). invest in more cheap resources, cons distribute workload

concurrency: as we approach greater number of "work to be completed", some resources might be not fully exploited. For instance some might be idling because of insufficient shared resources (i.e. resource starvation). Schedulers handle this issue making sure to provide sufficient resources to each worker.

failures:

- work failures: a worker leave, and you know have to assign another one to his task. This might potentially slowing down the execution, however it won't affect previous work (aka data loss)
- data loss: some accident happens and you have to start from the beginning. The scheduler stop and restart from the beginning the whole process.




1.3 Review Part 1

  • Dask can be used to scale popular Python libraries such as Pandas and NumPy allowing to analyse dataset with greater size (>8GB)
  • Dask uses directed acyclical graph to coordinate execution of parallelized code across processors
  • Directed acyclical graph are made up of nodes, clearly defined start and end that can be transverse in only one logical way (no looping).
  • Upstream actions are completed before downstream nodes.
  • Scaling out (i.e. add workers) can improve performances of complex workloads, however create overhead that can reduces gains.
  • In case of failure, the step to reach a node can be repeated from the beginning without disturbing the rest of the process.

Part 2: Introduction to DASK

  • Warm up with short example of data cleaning using Dask DataFrames
  • Visualize Directed Acyclical Graph generated by Dask workloads with graphviz
  • Explore how the scheduler applies the DAGs to coordinate execution code
  • You will learn:
    • Dask DataFrame API
    • Use diagnostic tool
    • Use low-level Delayed API to create custom graph

2.1 Exploratory data analysis with DASK

  • Set up environment and working directory
  • Load data
  • Check data quality issue (e.g. missing values and outliers)
  • Drop columns (not useful for analysis AKA missing many values)

2.1.1 Set up environment and working directory

In [1]:
# import libraries
import sys
import os

## import dask libraries
import dask.dataframe as dd
from dask.diagnostics import ProgressBar

# import libraries
import pandas as pd
In [3]:
# assign working directory [CHANGE THIS]. It can not fit in github so I have it locally. Download files 
os.chdir('/nyc-parking-tickets')
cwd = os.getcwd()

# print 
print('', sys.executable)
print('', cwd)
 /Users/haibui/.pyenv/versions/3.7.7/bin/python3.7
 /Users/haibui/00_MIT_Harvard_CS_DS/harvard_data_science/daskdemo/notebook/nyc-parking-tickets

2.1.2 Load data

In [4]:
## read data using DataFrame API
df = dd.read_csv('Parking_Violations_Issued_-_Fiscal_Year_2017.csv')
df
Out[4]:
Dask DataFrame Structure:
Summons Number Plate ID Registration State Plate Type Issue Date Violation Code Vehicle Body Type Vehicle Make Issuing Agency Street Code1 Street Code2 Street Code3 Vehicle Expiration Date Violation Location Violation Precinct Issuer Precinct Issuer Code Issuer Command Issuer Squad Violation Time Time First Observed Violation County Violation In Front Of Or Opposite House Number Street Name Intersecting Street Date First Observed Law Section Sub Division Violation Legal Code Days Parking In Effect From Hours In Effect To Hours In Effect Vehicle Color Unregistered Vehicle? Vehicle Year Meter Number Feet From Curb Violation Post Code Violation Description No Standing or Stopping Violation Hydrant Violation Double Parking Violation
npartitions=33
int64 object object object object int64 object object object int64 int64 int64 int64 float64 int64 int64 int64 object object object object object object object object object int64 int64 object object object object object object float64 int64 object int64 object object float64 float64 float64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-csv, 33 tasks

note that

  • metadata are shown in the frame instead of data sample
  • syntax is pretty similar to Pandas API
  • # partitions is the number of splits used to separate the main dataset. The optimal number is decided by the scheduler that split Pandas DataFrame into smaller chuncks. In this case each partitions is ~64MB (i.e. Dataset size/npartitions = 2GB/33). If we have one worker, Dask will cycle to each partition one at time.
  • data types are reported under each column name (similary to describe method in Pandas, however performed through random sampling because data scattered across multiple physical machines). Good practice is to explicit data types instead relying on Dask's inference (store in binary format ideally, see links to Write and Read Parquet).
  • Dask Name reports the name of the DAG (i.e. from-delayed)
  • # tasks is the number of nodes in the DAG. You can think of a task as a Python function and in this case each partition operates 3 tasks: 1) reading the raw data, 2) splitting the data in appropriate blocks and, 3) initialazing the DataFrame object. Every task comes with some overhead (between 200us and 1ms).


Partition process from Data Science with Python and Dask

Want to know more?

2.1.3 Check data quality issue

In [5]:
# count missing values
missing_values = df.isnull().sum()
missing_values
Out[5]:
Dask Series Structure:
npartitions=1
Date First Observed    int64
Violation Time           ...
dtype: int64
Dask Name: dataframe-sum-agg, 100 tasks

note that

  • dask object created is a series containing metadata and syntax is pretty similar to Pandas API
  • processing hasn't be completed yet, instead Dask prepared a DAG stored in the missing values variable (advantage of building graph quickly without need to wait for computation)
  • # tasks increased because have been added 2 tasks (i.e. check missing values and sum) for each of the 33 partitions as well as a final addition to aggregate the results among all the partitions for a total of 166 = [(66+1)+99]
In [6]:
# calculate percent missing values
mysize = df.index.size
missing_count = ((missing_values / mysize) * 100)
missing_count
Out[6]:
Dask Series Structure:
npartitions=1
Date First Observed    float64
Violation Time             ...
dtype: float64
Dask Name: mul, 169 tasks

note that

  • dask object created is a series and computation hasn't be completed yet
  • df.index.size is a dask object dask.dataframe.core.Scalar . You cannot access its value/lenght directely like you would do with a list (e.g. len()). It would go against the whole idea of dask (i.e. read all the dataset)
  • # tasks increased because have been added 2 tasks (i.e. division and multiplication)
  • data type ahs changed from int64 to float64. Dask automatically converted it once the datatype at the output might not match the input after the division
In [8]:
# run computations using compute method
with ProgressBar():
    missing_count_percent = missing_count.compute()
missing_count_percent
[##################################      ] | 86% Completed | 56.3s
/Users/haibui/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/pool.py:121: DtypeWarning: Columns (18,38) have mixed types.Specify dtype option on import or set low_memory=False.
  result = (True, func(*args, **kwds))
[########################################] | 100% Completed | 59.5s
Out[8]:
Summons Number                         0.000000
Plate ID                               0.006739
Registration State                     0.000000
Plate Type                             0.000000
Issue Date                             0.000000
Violation Code                         0.000000
Vehicle Body Type                      0.395361
Vehicle Make                           0.676199
Issuing Agency                         0.000000
Street Code1                           0.000000
Street Code2                           0.000000
Street Code3                           0.000000
Vehicle Expiration Date                0.000000
Violation Location                    19.183510
Violation Precinct                     0.000000
Issuer Precinct                        0.000000
Issuer Code                            0.000000
Issuer Command                        19.093212
Issuer Squad                          19.101506
Violation Time                         0.000583
Time First Observed                   92.217488
Violation County                       0.366073
Violation In Front Of Or Opposite     20.005826
House Number                          21.184968
Street Name                            0.037110
Intersecting Street                   68.827675
Date First Observed                    0.000000
Law Section                            0.000000
Sub Division                           0.007155
Violation Legal Code                  80.906214
Days Parking In Effect                25.107923
From Hours In Effect                  50.457575
To Hours In Effect                    50.457548
Vehicle Color                          1.410179
Unregistered Vehicle?                 89.562223
Vehicle Year                           0.000000
Meter Number                          83.472476
Feet From Curb                         0.000000
Violation Post Code                   29.530489
Violation Description                 10.436611
No Standing or Stopping Violation    100.000000
Hydrant Violation                    100.000000
Double Parking Violation             100.000000
dtype: float64

note that

  • .compute() method is necessary to run the actions embedded in each node of the DAG
  • The results of the compute method are stored into a Pandas Series
  • ProgressBar() is a wrapper to keep track of running tasks. It shows completed work
  • from a quick visual inspection we can see that are columns that are incomplete and we should drop

2.1.4 Drop columns

In [9]:
# filter sparse columns(greater than 60% missing values) and store them
columns_to_drop = missing_count_percent[missing_count_percent > 60].index
print(columns_to_drop)

# drop sparse columns
with ProgressBar():
    #df_dropped = df.drop(columns_to_drop, axis=1).persist()
    df_dropped = df.drop(columns_to_drop, axis=1).compute()
Index(['Time First Observed', 'Intersecting Street', 'Violation Legal Code',
       'Unregistered Vehicle?', 'Meter Number',
       'No Standing or Stopping Violation', 'Hydrant Violation',
       'Double Parking Violation'],
      dtype='object')
[#####################################   ] | 93% Completed |  1min  5.2s
/Users/haibui/.pyenv/versions/3.7.7/lib/python3.7/site-packages/dask/core.py:121: DtypeWarning: Columns (18,38) have mixed types.Specify dtype option on import or set low_memory=False.
  return func(*(_execute_task(a, cache) for a in args))
[########################################] | 100% Completed |  1min  6.2s

note that

  • use Pandas Series to drop columns in Dask DataFrames because each partition is a Pandas DataFrame
  • In the case the Series is made available to all threads, in a cluster it would be serialized and brodcast to all the worker nodes
  • .persit() method allows to store in memory intermediate computations so they can be reused.

2.2 Visualize directed acyclic graphs (DAGs)

  • DASK uses graphviz library to generate visual representation of the DAGs created by the scheduler
  • Use .visualize() metod to inspect the DAGs of DataFrames, Series, Bag, and arrays
  • For simplicity we will use Dask Delayed object instead of DataFrames since they grow quite large and hard to visualize
  • delayed is a constructor that allows to wrap functions and create Dask Delayed objects that are equivalent to a node in a DAG. By chaining together delayed object, you create the DAG
  • Below are two examples, in the first one you build a DAG with one node only with dependencies and in the second one you build another with multiple nodes with dependencies

2.2.1 Example 1: DAG with one node with depenndecy

In [16]:
# import library
import dask.delayed as delayed
In [17]:
def increment(i):
    return i + 1

def add(x, y):
    return x + y

# wrap functions within dealyed object and chain
x = delayed(increment)(1)
y = delayed(increment)(2)
z = delayed(add)(x, y)

# visualize the DAG
z.visualize()
Out[17]:
In [18]:
# show the result
z.compute()
Out[18]:
5

note that

  • to build a node wrap the function with the delayed object and then pass the arguments of the function. You could also use a decoretors (see documentation).
  • circles symbolize function and computations while squares intermediate or final result
  • incoming arrows represent dependecies. The increment function do not have any dependency while add function has two. Thus, add function has to wait until objects x and y have been calculated
  • functions without dependencies can be computed independentely and a worker can be assigned to each one of them
  • use method .visualize() on the last node with dependencies to peak at the DAG
  • Dask does not compute the DAG . Use the method .compute() on the last node to see the result

2.2.2 Example 1: DAG with more than one node with dependencies

  • we are going to build a more complex DAG with two layers:
    • layer1 is built by looping over a list of data using a list comprehension to create dask delayed objects as "leaves" node. This layer combine the previously created function increment with the values in the list, then use the built in function sum to combine the results;
    • layer2 is built looping on each object created in layer1 and
In [19]:
data = [1, 2, 3, 4, 5]

# compile first layer and visualize
layer1 = [delayed(increment)(i) for i in data]
total1 = delayed(sum)(layer1)
total1.visualize()
Out[19]:
In [20]:
def double(x):
    return x * 2

# compile second layer and visualize
layer2 = [delayed(double)(j) for j in layer1]
total2 = delayed(sum)(layer2)#.persist()
total2.visualize()
Out[20]:
In [21]:
z = total2.compute()
z
Out[21]:
40

note that

  • built in function
  • persistent using .persist() and it will be represented as a rectangle in the graph
In [22]:
## TASK
# visualize DAGs built from the DataFrame 
missing_count.visualize()
Out[22]:

2.3 Task scheduling

  • Dask performs the so called lazy computations. Remember, until you run the method .compute() all what Dask does is to split the process into smaller logical pieces (which avoid loading all the dataset)
  • Even though the process is defined, the number of resources assigned and the place where the results will be stored are note assigned because the scheduler assign them dynamically. This allow to recover from worker failure, network unreliability as well as workers completing the tasks at different speeds.
  • Dask uses a central scheduler to orchestrate all this work. It splits the workload among different servers which unlikely they are perfectly assigned the same load, power or access to data. Due to these conditions, scheduler needs to promptly react to avoid bottlenecks that will affect overall runtime.
  • For best performance, a Dask cluster should use a distributed file system (S3, HDFS) to back its data storage. Assuming there are two nodes like in the image below and data are stored in one. In order to perform computation in the other node we have to move the data from one to the other creating an overhead proportional to the size of the data. The remedy is to split data minimizing the number of data to broadcast across different local machines.


Local disk from Data Science with Python and Dask

  • Dask scheduler takes data locality into consideration when deciding where to perform computation.

2.4 Review Part 2

  • The beauty of this code is that you can reuse in one machine or a thousand
  • Similar syntax helps transition from Pandas to Dask (good in general for refactoring your code)
  • Dask DataFrames is a tool to parallelize computing done popular Pandas that allow to clean and ananlyze large dataset
  • Dask parallize Pandas Dataframe and more in general work using DAGs
  • Computation are structured by the task scheduler using DAGs
  • Computation are constructed lazily and then compute the method
  • Use visualize method for a visual representation
  • Computation can persist in memory avoid slowdown to replicate result
  • Data locality help minimizing network and IO latency

Part 3: Exercise with DASK

3.1 Learn how to manipulate structured data

  • Build a Dask DataFrame from a Pandas DataFrame
In [23]:
# create lists with actions
action_IDs = [i for i in range(0,10)]
action_description = ['Clean dish', 'Dress up', 'Wash clothes','Take shower','Groceries','Take shower','Dress up','Gym','Take shower','Movie']
action_date = ['2020-1-16', '2020-1-16', '2020-1-16', '2020-1-16', '2020-1-16', '2020-1-17', '2020-1-17', '2020-1-17', '2020-1-17', '2020-1-17']

# store list into Pandas DataFrame
action_pandas_df = pd.DataFrame({'Action ID': action_IDs, 
                                 'Action Description': action_description, 
                                 'Date': action_date},
                                 columns=['Action ID', 'Action Description', 'Date'])
action_pandas_df
Out[23]:
Action ID Action Description Date
0 0 Clean dish 2020-1-16
1 1 Dress up 2020-1-16
2 2 Wash clothes 2020-1-16
3 3 Take shower 2020-1-16
4 4 Groceries 2020-1-16
5 5 Take shower 2020-1-17
6 6 Dress up 2020-1-17
7 7 Gym 2020-1-17
8 8 Take shower 2020-1-17
9 9 Movie 2020-1-17
In [24]:
# convert Pandas DataFrame to a Dask DataFrame
action_dask_df = dd.from_pandas(action_pandas_df, npartitions=3)
In [25]:
action_dask_df
Out[25]:
Dask DataFrame Structure:
Action ID Action Description Date
npartitions=3
0 int64 object object
4 ... ... ...
8 ... ... ...
9 ... ... ...
Dask Name: from_pandas, 3 tasks
In [26]:
# info Dask Dataframe
print('', action_dask_df.divisions)
print('<# partitions>', action_dask_df.npartitions)
 (0, 4, 8, 9)
<# partitions> 3
In [27]:
# count rows per partition
action_dask_df.map_partitions(len).compute()
Out[27]:
0    4
1    4
2    2
dtype: int64
In [28]:
# filter entries in dask dataframe
print('\n ')
action_filtered = action_dask_df[action_dask_df['Action Description'] != 'Take shower']
print(action_filtered.map_partitions(len).compute())

print('\n ')
action_filtered_reduced = action_filtered.repartition(npartitions=1)
print(action_filtered_reduced.map_partitions(len).compute())
 
0    3
1    3
2    1
dtype: int64

 
0    7
dtype: int64

3.2 Summary part 3 and dask limitations

  • dataframe are immutable. Functions such as pop and insert are not supported
  • does not allow for functions with a lot of shuffeling like stack/unstack and melt
  • limit this operations after major filter and preprocessing
  • join, merge, groupby, and rolling are supported but expensive due to shuffeling
  • reset index starts sequential counting for each partitions
  • apply and iterrow are known to be inefficient in Pandas, the same for Dask
  • use .division() to inspect how DataFrame has been partitioned
  • for best performance partitions should be rougly equal. Use .repartition() method to balance across datasets
  • for best performance sort by logical columns, partition by index and index should be presorted
In [ ]:
 
In [ ]: