CS109A Introduction to Data Science

Lecture 1: Example

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner


Title

Hub data, part 1: Reading data, examining them and formulate questions.

Description

Introduction: Hubway was metro-Boston’s public bike share program, with more than 1600 bikes at 160+ stations across the Greater Boston area. Hubway was owned by four municipalities in the area.

By 2016, Hubway operated 185 stations and 1750 bicycles, with 5 million ride since launching in 2011.

The Data: In April 2017, Hubway held a Data Visualization Challenge at the Microsoft NERD Center in Cambridge, releasing 5 years of trip data.

The Question: What does the data tell us about the ride share program?

The original question: ‘What does the data tell us about the ride share program?’ is a reasonable slogan to promote a hackathon. It is not good for guiding scientific investigation.

Before we can refine the question, we have to look at the data!

Note: Here we switch the order of the "data science process"

In [1]:
import sys
import zipfile
import datetime
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from math import radians, cos, sin, asin, sqrt
from sklearn.linear_model import LinearRegression

sns.set(style="ticks")
%matplotlib inline
In [13]:
import os

DATA_HOME = os.getcwd()
if 'ED_USER_NAME' in os.environ:
    DATA_HOME = '/course/data'

HUBWAY_STATIONS_FILE = os.path.join(DATA_HOME, 'hubway_stations.csv')
HUBWAY_TRIPS_FILE = os.path.join(DATA_HOME, 'hubway_trips_sample.csv')
In [14]:
hubway_data = pd.read_csv(HUBWAY_TRIPS_FILE, index_col=0, low_memory=False)
hubway_data.head()

Basic Summaries

In [15]:
hubway_data.describe()

What Type Of

In [16]:
hubway_data.dtypes

Go to Part 1 quiz and enter your questions. Once you are done return to the main room.

In [17]:
 
In [ ]: