Key Word(s): TOPIC_1, TOPIC_2, TOPIC_3



Lecture 4: Dask

Dask is a useful tools 1. Add general description.

Dask is a useful tools 2. Add infrastructure/platform description. Link to documentation Dask and why it is cool Dask developer community on twitter.

Dask is a useful tools 3. a) What you can do you, b) what other advanced things people do.

We'll go over the basics of some Dask services, but we should point out that a lot of talented people have given tutorials, check'em out.

BLA BLA

Table of Contents

ADD TO LECTURE PRESENTATION

<1.1 Why Dask>

<1.2 Cooking with DAGs>

<1.3 Scaling out, concurrency, and autorecovery>

<2.1 Hello Dask: A first look at the DataFrame API>

<2.2 Visualizing DAG>

<2.3 Task Scheduling>

<3.1 Why to use DataFrames>

<3.2 Dask and Pandas: DataFrame partitioning and shuffle>

<3.3 Limitations>

GO TO NOTEBOOK (MAYBE ONE SUMMARY SLIDE AND JUMP INTO NOTEBOOK)

<4.1 Read data, datasets, and datatypes>

<4.2 Reading from relational database, HDFS, and S3>

<5.1 Indexes, selecting, and dropping> <5.1.1 Filtering and Reindexing>

<5.2 Joining, concatenating, and unioning> <5.2.1 Recording data>

<5.3 Elementwise operations>

<5.4 Dealing with missing values>

<6.1 Descriptive Statistics>

<6.2 Built-In aggregate functions>

<6.3 Custom aggregate functions>

<6.4 (Rolling window) functions>

<7.1 Preare-reduce-collect-plot>

<7.2 Continuous and categorial>

<7.3 Density plots>

<7.4 Random samples>

<7.5 Heatmap>

SKIP NEXT OR THIS IS THE MOST INTERESTING?

SHOW CASE NOTEBOOK IN CLASS

<9.1 Reading and parsing unstructured data with Bags> <9.1.1 Selecting and viewing data from a Bag> <9.1.2 Common parsing issue> <9.1.3 Delimiters>

<9.2 Transforming, filtering, and folding> <9.2.1 Map, filter, and aggregate functions (foldby)> <9.2.2 Building Arrays from bags> <9.2.3 Summary stats on bags: parallel text analysis> <9.2.3.1 Bigrams> <9.2.3.2 Tokens and filtering stopwords> <9.2.3.2 Analyze bigrams>

<10.1 Linear Models with Dask-ML>

<10.2 Evaluating and tuning Dask-ML models>

<10.3 Persisting Dask-ML models>

EXERCISE: YOU LEARNED IT, NOW DO IT. REPRODUCE A PAPER FROM READING LIST

<11.1 Building a Dask cluster on Amazon AWS with Docker> <11.1.1 to 11.1.7>

<11.2 Running and monitoring Dask jobs on a cluster>

<11.3 Cleaning up the Dask clusters on AWS>

Part 1: Scalable computing

If you ask different people, you'll get different answers, but one of the commonalities is that most people don't realize is that eventhough these services come with costs (i.e. both monetary and training), they provides great resources that social scientists should start exploring themself. Here are some highlights:

  • You can use them “anytime, anywhere”: public cloud users can access, barely always, cloud services and keep their data stored safely in the infrastructure.
  • You won't need to plan far ahead for provisioning: public cloud users can use infinite computing and storaging resources available on demand. In this way, the user can offload some problems to the service provider such as mantaining both hardware and software.
  • You can buy what you need, when you need it: public cloud allows you to use services eliminating any sort of up-front commitment by Cloud user.
  • Public cloud allows teams to collaborate: Public cloud allows you to share data and collaborate more easily.

Public cloud basics

Cloud Computing refers to both the applications delivered as services over the Internet and the hardware/systems software in the datacenters that provide those services. The datacenter hardware and software is what we will call a Cloud. When a Cloud is made available to the public (through pay-as-you-go services), it is called a Public Cloud; the service being sold from the datacenter to the provider such as Microsoft, Google, Amazon or IBM (which might or not might be the same) called computing utility [1] and the one sold from provider to user that we will refer as a web application or more in general a service. Current examples of public computing services include AmazonWeb Services, Google AppEngine, and Microsoft Azure. A differenet deployment system from the public is the private. The private Cloud refers to internal datacenters of a business or other organization that are not made available to the public. The figure below shows the roles of the people as users or providers of different layers of the Cloud.

cloud_basics_from_article_1

Image adapted from Armbust et al., 2009

The National Institute of Standards and Technology (NIST) defines cloud as [2]:

“a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

Cloud computing has 5 essential characteristics:

  1. Measured Service
  2. Rapid Elasticity
  3. Broad Network Access
  4. Resource Pooling
  5. On Demand Self Service

You can think about cloud computing as composition of three different areas/models:

Iass_Paas_SaaS_image

  • The Infrastructure as a Service (IaaS) is a cloud of resources provided to users. IaaS provides the basic functionality of storage and computing as service consisting in network servers, storage systems, network instrumentality, and information headquarters.
  • The Platform as a Service (PaaS) is a development environment provided as a service. These advanced levels of alternative service can be designed by end-users. PaaS offers virtualized servers that multiple users work on applications or grow innovative applications, with no having to concern about keeping operating systems, server hardware, equalization or weight power calculation.
  • The Software as a Service (SaaS) is an application that is offered to consumers using the Internetwork. A single case of service works in the cloud and multiple end users services. An example of Cloud Computing is Software as a Service (SaaS), where you input data on software and the data is transformed remotely through a software interface without your computer being involved. Thus SaaS eliminates client fears on storage, server applications, application development, and related common concerns of IT.

In this to tutorial we are going to show you examples encompasses PaaS, when you will build the experiment using the two tutorials, and SaaS, when you will learn how to use the cloud services in the three guides.

Cloud pros and cons

Overall, the advantages of using a cloud outperforms the disadvantage. When deciding to build your own application, which an example in this series of guides is the experiment, it is important to consider a couple of aspects: cloud security, and cloud service accuracy. In respect to the first, we reccomend making sure that data being stored in the cloud meet the requirements of your institution/research as well as checking the warranty of the chosen provider. In respect to the second, we suggest to consider the accuracy of the algorithm provided by the service for the purpose of the research but this fall outside the purpose of this workshop. Naeem et al.[3] discusses some of the advantages and disadvantages when using cloud computing. We report them from their study in the chart below:

Disadvantages Advantages
Security in the Cloud Almost Unlimited Storage
Technical Issues Quick Development
Non-Interoperability Cost Efficient
Dependency and vendor lock-in Automatic Software Integration
Internet Required Shared Resources
Less Reliability Easy Access to Information
Less management Mobility
Raised Vulnerability Better Hardware Management
Prone to Attack Backup and Recovery

Getting started: first access to Azure

Create your Azure free account

To access Azure cloud computing services, you will have to sign up for an Azure free account, if you don’t already have one. If you do not have a Microsoft account either you will be asked to create one, otherwise insert your outlook account (e.g. your_email_address@outlook.com). To create your Azure account you will be asked to add your credentials as well as a credit card account. This will not be charged unless you exceed the credit provided with the one month trial version [4]. We reccomend to cancel the account after the first month in case you are not interested in the service.

Follow the next steps to set up a free account:

free_account

  • Click on start free

start_free

  • Create an Azure account: choose name, set password, add security info, add credit/debit card information

create_azure_account

Login to Azure Dashboard and create your first service

Once you have created your Azure free account, you just need to go to the Azure portal and login using the credentials.

  • Go to https://portal.azure.com/ and sign-in to your account to access Azure Dashboard (note: familiarize with the relevant services?)

select_account

You are now ready to deploy your first public cloud service! Follow the next steps:

  • Click on create a resource

create_resource

  • Write on the bar the name of the service you want to subscribe to. For convinience we show the Storage Account that will we use in the next guide. Type storage account in the bar and then press enter

select_resource

  • You will be directed to the a view containing a short description of the service, as well as links to the documentation and pricing information. Click on create to start deploying the services.

create_storage_account

  • Next, complete entering the following intormation and click on create once finished:
    • Account name: enter lowercase and globally unique name (e.g. "mycloudstorageplayground")
    • Deployment model: click on Resource manager
    • Account kind: Storage v1
    • Location: East US
    • Replication: Locally-reduntant storage (LRS)
    • Performance: click on Standard
    • Secure transfer required: Enabled
    • Subscription: Free Trial
    • Select server region: Eastus
    • Resource Group: create a new entering name (e.g. myresourcegroup)
    • Virtual networks: click on Enabled
    • Pin to dashboard (optional): [x]

storage_setup

  • Once the service is deployed, you will see on your Dashbord a white box with your storage account's name. Click on the box with your storage_account_name to access the storage account interface.

storage_account_dashboard

  • The storage account interface shows a summary of the settings defined in the previous steps and other utilities. On the top right box you can see the region from which your service is deployied, the type of storage you have choosen as well as the type of contents you decided to store (i.e. Locally Reduntant Storage which stands for data you might use a lot and the server will know). You can also find the id of your subrscition below erased for privacy purposes.

storage_account_interface

Create API key to use Microsoft Azure service

We have shown you how to login to the Azure portal and how to create a Storage Account, now it is time to retrieve the key necessary to use it. We will use the key in the next guide to make requests to Microsoft Azure using Application Programming Interface (API). The API functions as an intermediary that allows two applications to talk to each other, in our case our software and Azure SaaS. The API key allows Azure to identifies your subscription account and to bill it (unless you switch your free account to a pay as you go account your account will not be billed).

  • To retrieve your Storage Account key, start from going to the dashboard and clicking on the box with your storage_account_name. Then, click on on Access keys on the side bar.
  • Copy storage account name and key1, clicking on the icon in the left, and paste them in the script below.

access_and_keys

Script: create a set of keys for using Azure services

In the next guides, we are going to poke around with several Azure services. We reccomend you to create all the services on the list below and to save the name you will give to each service and primary key in the cell below. Here is a complete list of the services that you should create:

  • Storage Account
  • Face
  • Computer Vision
  • Bing Speech Recognition
  • Text Analytics

When looking for a service, we recommend to click on the Create a Resource button and to copy each service name on the finder bar as shown before. This will avoid you to look for service at the time, and some headache from navigating yourself through the myriad of services available. Run the cell when you are done, and a file with your key will be automatically generated and stored into the folder public_cloud_computing/guides/keys.

In [ ]:
###############################################################
# copy and paste your services' account name and primary  key #
###############################################################

# STORAGE_ACCOUNT
STORAGE_ACCOUNT_NAME = '' #add your account name
STORAGE_ACCOUNT_API_KEY = '' #add your account key1

# COGNITIVE_SCIENCE_FACE_ACCOUNT
FACE_ACCOUNT_NAME = '' 
FACE_API_KEY = '' 

# COGNITIVE_SCIENCE_COMPUTER_VISION_ACCOUNT
COMPUTER_VISION_NAME = ''
COMPUTER_VISION_API_KEY = ''

# SPEECH_RECOGNITION_ACCOUNT
SPEECH_RECOGNITION_NAME = ''
SPEECH_RECOGNITION_KEY = ''

# TEXT_ANALYTICS_ACCOUNT
TEXT_ANALYTICS_NAME = ''
TEXT_ANALYTICS_API_KEY = '' 

#run this cell to write a copy of your Azure services information (NAME and API's key)
#write a dictionary
azure_services_keys = {'STORAGE': {'NAME': STORAGE_ACCOUNT_NAME, 'API_KEY': STORAGE_ACCOUNT_API_KEY}, 
                       'FACE': {'NAME': FACE_ACCOUNT_NAME, 'API_KEY': FACE_API_KEY},
                       'COMPUTER_VISION': {'NAME': COMPUTER_VISION_NAME, 'API_KEY': COMPUTER_VISION_API_KEY}, 
                       'SPEECH_RECOGNITION': {'NAME': SPEECH_RECOGNITION_NAME, 'API_KEY': SPEECH_RECOGNITION_KEY},
                       'TEXT_ANALYTICS': {'NAME': TEXT_ANALYTICS_NAME, 'API_KEY': TEXT_ANALYTICS_API_KEY}}

#dump the dictionary on a file and saved in the folder < /guides/keys >
#import modules
import pickle
import json
#open a .json file and copy the dictionary with all your keys
with open("../keys/azure_services_keys.json", 'wb') as f:
    pickle.dump(azure_services_keys, f)
    
################################
# run this cell once completed #
################################

Recap

What you have learnt

  • What is cloud and its advantages
  • Access the Azure portal
  • How to deploy public cloud service
  • Now that you know more about cloud, what do you think about it?

What you will learn next guide

  • How to use public cloud services:
    • What is a cloud storage
    • Access Azure cloud storage using Storage Explorer UI and with Python SDK
    • Create BLOB container and Upload BLOB (Big Large Binary Objects AKA image, audio, etc.)

Question for you

  • Now that you know more about cloud, what do you think about it?
  • When would it be useful in your work, research?
Footnotes
  • [1] Armbrust et al, 2009. Above the Clouds: A Berkeley View of Cloud Computing
  • [2] Peter Mell and Timothy Grance, 2011. The NIST Definition of Cloud Computing: recommendations of the National Institute of Standards and Technology
  • [3] Naeem et al, 2016. Cluster Computing vs Cloud Computing: a comparison and overview
  • [4] At subscription of a free account you will receive 200 dollars for 30 days to try pay as you go cloud services and a free account for a year. Once you exceed 200 dollars or the 30 days free trial will expired you will be asked to upgrade your subscription.
In [ ]:
#import library to display notebook as HTML
import os
from IPython.core.display import HTML

#path to .ccs style script
cur_path = os.path.dirname(os.path.abspath("__file__"))
new_path = os.path.relpath('..\\..\\styles\\custom_styles_public_cloud_computing.css', cur_path)

#function to display notebook
def css():
    style = open(new_path, "r").read()
    return HTML(style)
In [ ]:
#run this cell to apply HTML style
css()