CS109A Introduction to Data Science

Standard Section 2: Prediction using kNN and Linear Regression

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Section Leaders: Cecilia Garraffo, Mehul Smriti Raje, Ken Arnold, Karan Motwani


In [4]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[4]:

For this section, our goal is to get you familiarized with k-Nearest Neighbors and Linear. These methods find powerful applications in all walks of life and are centered around prediction.

Specifically, we will:

1. Review Basic Python Data Structures
2. Import Data and Manipulates Rows and Columns
3. Load in the Bikeshare dataset which is split into a training and testing dataset
3. Do some basic exploratory analysis of the dataset and go through a scatterplot
5. Write out the algorithm for kNN WITH AND WITHOUT using the sklearn package
6. Learn to use the sklearn package for Linear Regression.
7. What is and how to extract information about Confidence Intervals.

For this section we will be using the following packages:

In [ ]:
#Check Python Version
import sys
assert(sys.version_info.major==3),print(sys.version)

#Matrices, Dataframe and Plotting Operations
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

#Model Packages for k-NN and Linear Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from statsmodels.api import OLS
import statsmodels.api as sm

#Metrics, Performance Evaluation and Helpful fucntions
from sklearn import metrics, datasets
from collections import Counter
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

#Aesthetic settings
from IPython.display import display
pd.set_option('display.max_rows', 999)
pd.set_option('display.width', 500)
pd.set_option('display.notebook_repr_html', True)

Review – Python : List, Dictionary Comprehensions and Zip

**Exercise**
In [ ]:
#Makes a list of all the even numbers from 0 to 20 with a For Loop

##Your Code Here

Solution

In [ ]:
# %load sol1
**Exercise**
In [ ]:
#Create a dictionary with the list A elements as keys and list B elements 
#as values using Zip and Dictionary Comprehension

list_a = ['a','b','c','d','e']
list_b = [1,2,3,4,5]

##Your Code Here
In [ ]:
# %load sol2

Working with Dataframes

Importing Data

In [ ]:
#Import Data
data_path = 'data/states.csv'
states = pd.read_csv(data_path, index_col=0)
states.head()

Data describing the population per state from 1900 to 2010

Select Specific Columns

**Exercise**
In [ ]:
#Select by Name : Select only '1930','1940','1950','1960' columns.

#Select by Index : Select only the first 3 columns

#Select by List : Select the following list of columns
columns_list = ['1920','1840']

Solution

In [ ]:
# %load sol3.py

Learn to "display", always!

In the above statements, we use the display function which can be imported using from IPython.display import display. It provides a very aesthetically pleasing representation of the data with great demarcation for row and column information. Let's see a comparison below :

In [ ]:
print(states[columns_list].head())
In [ ]:
display(states[columns_list].head())

Select Specific Rows

**Exercise**
In [ ]:
#Filter out by First N Rows
n = 25

#Select by Index, rows 5 to 10.

#Select by List of Indices below
row_list = np.arange(1,25,2)
print(row_list)

Solution

In [ ]:
# %load sol4.py

Load in the Bikeshare dataset and perform EDA:

The specific task is to build a regression model for a bike share system that can predict the total number of bike rentals in a given day, based on attributes about the day. Such a demand forecasting model would be useful in planning the number of bikes that need to be available in the system on any given day, and also in monitoring traffic in the city. The data for this problem was collected from the Capital Bikeshare program in Washington D.C. over two years.

The data set is provided in the file 'bikeshare.csv'. Each row in these files contains 10 attributes describing a day and its weather:

  • season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
  • month (1 through 12, with 1 denoting Jan)
  • holiday (1 = the day is a holiday, 0 = otherwise)
  • day_of_week (0 through 6, with 0 denoting Sunday)
  • workingday (1 = the day is neither a holiday or weekend, 0 = otherwise)
  • weather
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp (temperature in Celsius)
  • atemp (apparent temperature, or relative outdoor temperature, in Celsius)
  • humidity (relative humidity)
  • windspeed (wind speed)

and the last column 'count' contains the response variable, i.e. total number of bike rentals on the day.

In [ ]:
#Load the BikeShare dataset
bikeshare = pd.read_csv('data/bikeshare.csv')
del bikeshare['Unnamed: 0']
print("Length of Dataset:",len(bikeshare))
display(bikeshare.head())
In [ ]:
display(bikeshare.describe())

We can also use the groupby function to look at mean stats aggregated by month:

In [ ]:
bikeshare.groupby('month').mean()

Let's plot the variation of count with month? Is there a seasonal change?

In [ ]:
plt.plot(bikeshare.groupby('month').mean()['count'])
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Bikeshare Rental Count as a function of Month')
plt.show()

What is temp, a_temp, is there a difference?

In [ ]:
plt.plot(bikeshare['temp'], bikeshare['atemp'])
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs Temp')
plt.show()

What did we do wrong here? Why does the plot look like this?

Sorting! Whenever your plot makes zig-zag changes across the scale, it is because matplotlib is trying to connect the points sequentially from the top (using a line plot) and skipping across the scale when $x_{i+1}$ is lower than $x_{i}$. So let's sort.

In [ ]:
new = bikeshare.sort_values(['temp'])
plt.plot(new['temp'], new['atemp'])
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs Temp')
plt.show()

It still looks weird, why?

Let's have a closer look at the dataframe:

In [ ]:
display(new.head())

There are multiple atemp values for each temp value, which if not sorted will bounce around at the same x-value. Thus, we need to sort both axes simultaneously.

In [ ]:
new = bikeshare.sort_values(['temp','atemp'])
plt.plot(new['temp'], new['atemp'])
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs Temp')
plt.show()

By plotting efficiently, we found an anomaly we would have otherwise overlooked. It looks like there is a problem with the data around temp greater than 30 and atemp less than 10.

**Exercise**

Solution

In [ ]:
# %load sol5.py

Anomaly! atemp and temp are usually lineary related except at this one datapoint. Now, we get to make a judgement call as to whether we should keep the datapoint? We'll come back to this question after the lecture on Missing Data and Imputation. Worth a thought though.

In [ ]:
bikeshare= bikeshare.drop([188])
In [ ]:
display(bikeshare[(bikeshare['temp']>30) & (bikeshare['atemp']<10)])

Normalize

In general, it is good practice to normalize data before proceeding. As such, we can create the following functions:

In [ ]:
#Function to takes in a dataset and normalizes it
def normalize(x):
    #print(x.shape)
    #print(np.min(x).shape)
    num = x - np.min(x)
    denom = np.max(x) - np.min(x)
    return (num / denom)

#bikeshare_norm = normalize(bikeshare.iloc[:, 0:10])
bikeshare_norm = normalize(bikeshare)
display(bikeshare_norm.head())
bikeshare_norm.describe()
**Exercise**

Solution

In [ ]:
# %load sol6.py
In [ ]:
#Adapt the normalization function above to take column names as input

#Function to takes in a dataset and normalizes it
def normalize_cols(x, columns):
    for i in columns:  x[i] = normalize(x[i]) 
    return x

bikeshare_norm2= normalize_cols(bikeshare ,bikeshare.columns )
In [ ]:
display(bikeshare_norm2.head())
bikeshare.describe()

Let's see what the describe function AFTER normalization:

Broadcasting

In [ ]:
from IPython.display import Image
Image('http://www.scipy-lectures.org/_images/numpy_broadcasting.png')

image from Scipy Lecture Notes

Choose one predictor

In [ ]:
bikeshare.columns
bikeshare.iloc[:,0:10].head()
In [ ]:
#sns.pairplot(bikeshare, x_vars=bikeshare.iloc[:,:10] , y_vars =bikeshare['count'])
In [ ]:
bikeshare = bikeshare[['temp', 'count']]
In [ ]:
# %load sol8.py
In [ ]:
bikeshare.head()

Split up the data into a training set and a test set:

Now that we have an idea of what the data looks like, we would like to predict count. Therefore, we will be breaking up the data into a training and a testing set. The training set will be used to train the model, while the testing set will be used to gauge how well our model does in general. The testing set is a way for us to ensure our model doesn't overfit our training data.

Let us first create a function that will randomly split the data up into a 70-30 split, with 70% of the data going into the testing set:

**Exercise**
In [ ]:
#Function to Split data into Train and Test Set, Enter Code Here
def split_data(data):
    
    #Calculate Length of Dataset
    
    #Define Split
    
    #Set a random Seed For Shuffling
    
    #Generate a Mask with a X:Y Split
    
    #Separate train and test data using mask
    
    #Return Train and Test Data Separately
    return data_train, data_test

Solution

In [ ]:
# %load sol6.py
In [ ]:
#Split data using defined function
train_data, test_data = split_data(bikeshare)
print("Length of Training set:",len(train_data))
print("Length of Testing set:",len(test_data))
**Exercise**
In [ ]:
## Check that the ratio between test and train sets is right
#Your Code Here

Solution

In [ ]:
# %load sol7.py

Old vs new indices

In [ ]:
train_data.head()
In [ ]:
train_data.iloc[:5]
In [ ]:
train_data.loc[3:9]
In [ ]:
train_data.iloc[3:9]

Alternative approach using Standard 'train_test_split' function from sklearn

In [ ]:
train_data, test_data = train_test_split(bikeshare, test_size=0.30, random_state=13)
print("Length of Training set:",len(train_data))
print("Length of Testing set:",len(test_data))

Implementing the kNN Algorithm by hand:

To really understand how the kNN algorithm works, it helps to go through the algorithm line by line in code.

In [ ]:
#kNN Algorithm
def knn_algorithm(train, test, k):
    
    #Create any empty list to store our predictions in
    predictions = []
    
    #Separate the response and predictor variables from training and test set:
    train_x = train['temp']
    train_y = train['count']
    test_x  = test['temp']
    test_y  = test['count']
    
    for i, ele in enumerate(test_x):
        
        #For each test point, store the distance between all training points and test point
        distances = pd.DataFrame((train_x.values - ele)**2 , index=train.index)
        distances.columns =['dist']
        
        #display(distances)
        #Then, we sum across the columns per row to obtain the Euclidean distance squared
        ##distances = vec_distances.sum(axis = 1)
        
        #Sort the distances to training points (in ascending order) and take first k points
        nearest_k = distances.sort_values(by='dist').iloc[:k]
        
        #For simplicity, we omitted the square rooting of the Euclidean distance because the
        #square root function preserves order. 
        
        #Take the mean of the y-values of training set corresponding to the nearest k points
        k_mean = train_y[nearest_k.index].mean()
        
        #Add on the mean to our predicted y-value list
        predictions.append(k_mean)
    
 
    
    #Create a dataframe with the x-values from test and predicted y-values  
    predict = test.copy()  
    predict['predicted_count'] = pd.Series(predictions, index=test.index)
    
    return predict

Now to run the algorithm on our dataset with $k = 5$:

In [ ]:
#Run the kNN function 

k = 5
predicted_knn = knn_algorithm(train_data, test_data, k)
predicted_knn.head()

We want to have a way to evaluate our predictions from the kNN algorithm with $k=5$. One way is to compute the $R^2$ coefficient. Let's create a function for that:

In [ ]:
#Test predictions in comparison to true value of test set
def evaluate(predicted, true):
    
    #Find the squared error:
    squared_error = (predicted['predicted_count'] - true['count'])**2
    
    #Finding the mean squared error:
    error_var = squared_error.sum()
    sample_var = ((true['count'] - true['count'].mean())**2).sum()
    r = (1 - (error_var / sample_var))
    return r

Then let's apply this function to our predictions:

In [ ]:
print("Length of Test Data:",len(test_data))
print("R^2 Score of kNN - test:", evaluate(predicted_knn, test_data))
In [ ]:
predicted_knn_train = knn_algorithm(train_data, train_data, k)

print("R^2 Score of kNN - train:", evaluate(predicted_knn_train, train_data))

We see that the coefficient for the Nearest Neighbors implementation with $k=5$ is $R^2 = 0.134$, which should more or less match what we get with the sklearn package.

Now using sklearn to implement kNN:

We will now use the sklearn package to implement kNN. Then, we can fit the model and use various metrics to assess our accuracy.

What is sklearn?

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use. The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:

  • NumPy: Base n-dimensional array package
  • SciPy: Fundamental library for scientific computing
  • Matplotlib: Comprehensive 2D/3D plotting
  • IPython: Enhanced interactive console
  • Sympy: Symbolic mathematics
  • Pandas: Data structures and analysis
  • Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.

The vision for the library is a level of robustness and support required for use in production systems. This means a deep focus on concerns such as easy of use, code quality, collaboration, documentation and performance.

General sklearn model fitting code-structure :

#Split Data into Train and Test Set
x_train, y_train = training_data.drop('Response_Variable', axis=1), training_data['Response_Variable']
x_test, y_test = test_data.drop('Response_Variable', axis=1), test_data['Response_Variable']

#Define Model
model = sklearn_model_name(hyper_parameter1 = value1, hyper_parameter2 = value2)

#Fit Model
model.fit(x_train, y_train)

#Get Prediction
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)

#Evaluate Model
r2_train = model.score(y_train, y_pred_train)
r2_test = model.score(y_test, y_pred_test)

#Print Results
print("Score for Model (Training):", r2_train)
print("Score for Model (Testing) :", r2_test)
  • Every model has a list of hyperparameters that can be set using sklearn for the specific problem. In practice it is advisable to cross-validate a list of values to find best model fit.

  • model.fit calculates the parameters of your model corresponding to the training data and hyperparameters you provided.

  • model.predict(X) is the standard method called to make the model predict values for a specific X. Depending on if you feed x_train or x_test, you will get a y_prediction_train or y_prediction_test respectively.

  • Evaluation of model can vary according to the task at hand i.e. Regression or Classification. For Regression, $R^2$ Score is standard while for Classification, Accuracy (%) is standard.

In [ ]:
# Set kNN parameter:
k = 100

# Now we can fit the model, predict our variable of interest, and then evaluate our fit:
# First, we create the classifier object:
neighbors = KNeighborsRegressor(n_neighbors=k)

# Then, we fit the model using x_train as training data and y_train as target values:
neighbors.fit(train_data[['temp']], train_data['count'])

# Retreieve our predictions:
prediction_knn = neighbors.predict(test_data[['temp']])

# This returns the mean accuracy on the given test data and labels, or in other words, 
# the R squared value -- A constant model that always predicts the expected value of y, 
# disregarding the input features, would get a R^2 score of 1.
r2_train = neighbors.score(train_data[['temp']], train_data['count'])
r2_test = neighbors.score(test_data[['temp']], test_data['count'])
print("Length of Test Data:", len(test_data['count']))
print("R^2 Score of kNN on test set:", r_test)
print("R^2 Score of kNN on training set:", r_train)
In [ ]:
# SubPlots

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
#axes[0].set_xlim([0.5, 0.7])
axes[0].plot(train_data['temp'], train_data['count'], 'o', label = 'Data' )#, '*',  label='Predicted')
axes[0].plot(train_data['temp'], neighbors.predict(train_data[['temp']]), '*', label = 'Prediction')
axes[0].set_xlabel('Normalized Temperature')
axes[0].set_ylabel('# of Rides')
axes[0].set_title("Training Set")
axes[0].legend()

axes[1].plot(test_data['temp'], test_data['count'],'o', label = 'Data' )#, '*')
axes[1].plot(test_data['temp'], prediction_knn, '*', label = 'Prediction')
axes[1].set_xlabel('Normalized Temperature')
axes[1].set_ylabel('# of Rides')
axes[1].set_title("Test Set")
axes[1].legend()

fig.suptitle("Bike Rides");
**Exercise**
In [ ]:
# Can you predict what will happen for k=1 and for k = number-of-observations-in-training-set? Try it!

Linear Regression

We just went over the kNN prediction method. Now, we will fit the same data, but Linear Regression. We will use a the same training/testing dataset as before and create our linear regression objects.

In [ ]:
#Split Data into X,Y
x_train, y_train = train_data.drop(['count'],axis=1), train_data['count']
x_test, y_test = test_data.drop(['count'],axis=1), test_data['count']

#Add constant
x_train_ca = sm.add_constant(x_train)
x_test_ca = sm.add_constant(x_test)

StatsModels use a Y followed by X structure while feeding data in contrast to sklearn that uses X followed by Y.

In [ ]:
#We must first create the linear regression object from stats model

model = sm.OLS(y_train, x_train_ca)
results = model.fit()
print(results.params)

Now, we will compute metrics that can be used to assess fit.

Note: sklearn.metrics is class of functions that consists of all the metrics we care about to evaluate our models. While it is not hard to implement them yourself, it is helpful to go through http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

In [ ]:
#To compute the mean squared error (notice that we are now using the TEST set):
print("R^2 Score for Linear Regression (Training):", metrics.r2_score(y_train, results.predict(x_train_ca)))
print("R^2 Score for Linear Regression (Testing) :", metrics.r2_score(y_test, results.predict(x_test_ca)))
In [ ]:
#Find the squared error:
y_pred_train = results.predict(x_train_ca)
squared_error_train = (y_pred_train - y_train)**2
 #Finding the mean squared error:
error_var_train = squared_error_train.mean()

sample_var_train = ((y_train - y_train.mean())**2).mean()



y_pred_test = results.predict(x_test_ca)
squared_error_test = (y_pred_test - y_test)**2
 #Finding the mean squared error:
error_var_test = squared_error_test.mean()

sample_var_test = ((y_test - y_test.mean())**2).mean()

print(error_var_train, sample_var_train, 1 - error_var_train/sample_var_train)
print(error_var_test, sample_var_test, 1 - error_var_test/sample_var_test)
In [ ]:
results.summary()
In [ ]:
plt.plot(x_train, y_train, 'b+')
plt.plot(x_test, y_test, 'r+')

x_forpredict = np.linspace(0,1, 100)
line_y = results.predict(sm.add_constant(x_forpredict))
plt.plot(x_forpredict, line_y)

Confidence Intervals

In Data Science, a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter. Simply speaking, a Confidence Interval is a range of values we are fairly sure our true value lies in.

It is important to remind ourselves here that Confidence Intervals belong to a parameter and not a statistic. Thus, they represent the window in which the true value exists for the entire population when all we have is a sample.

Example :

We measure the heights of 40 randomly chosen men, and get a mean height of 175cm, we also know the standard deviation of men's heights is 20cm. The 95% Confidence Interval is thus: 175cm ± 6.2cm. This says the true mean of ALL men (if we could measure all their heights) is likely to be between 168.8cm and 181.2cm.

But it might not be! The "95%" says that 95% of experiments like we just did will include the true mean, but 5% won't. So there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the true mean.

To get Confidence Intervals using StatsModels:

In [ ]:
#Confidence Interval using Stats Model Summary
thresh = 0.05
intervals = results.conf_int(alpha=thresh)
intervals = intervals.rename(index=str, columns={0:str(thresh/2*100)+"%",1:str((1-thresh/2)*100)+"%"})
display(intervals)

In the above block of code, results.conf_int(alpha=thresh) returns a dataframe with columns 0 and 1. We explained Confidence Intervals above where because we assume normal symetric distribution of data, the 95% Confidence Interval means there's 2.5% chance of the true value lying below the values in Column 0 and 2.5% chance of the true value lying above Column 1.

Therefore, for a better understanding, given a constant threshold, you can rename the columns to suit the Confidence Interval values. The syntax is dataframe.rename(index=str, columns={old_col_name1:new_col_name1, old_col_name2:new_col_name2})

In [ ]:
train_x.shape