CS109A Introduction to Data Science
Standard Section 2: kNN and Linear Regression¶
Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Hayden Joy, Lauren Baker, and Kaela Nelson
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
For this section, our goal is to get you familiarized with k-Nearest Neighbors (kNN) and Linear Regression. In the course thus far, we have discussed some aspects of dealing with data, including scraping data from the web, organizing it using dictionaries and Pandas dataframes, and visualizing it using Matplotlib plotting functionality. Now we're moving on to data modeling!
It is useful to make models to fit and predict data. Why? To understand the underlying behavior of your data.
By the end of this section, you should feel comfortable:
- Performing exploratory data analysis (EDA) on dataset
- Splitting this dataset into a training and test set (and understanding why you need to do this!)
- Applying simple models (kNN and Linear Regression) to your data using sklearn and statsmodels packages
- Using these models to understand relationships between the response variable and the predictors (also can be called features or descriptors)
- Evaluating model performance using metrics such as $R^2$
For this section we will be using the following packages:
#Matrices, Dataframe and Plotting Operations
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#Model packages
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from statsmodels.api import OLS
import statsmodels.api as sm
Load in the Bikeshare dataset and perform preliminary EDA¶
Here, we will work with a dataset collected from the Capital Bikeshare program in Washington D.C. This dataset contains over two years of data on the total number of bike rentals per day, as well as 10 attributes describing the day and its weather (see below for description of these variables as recorded in the dataset). The data set is provided in the file 'bikeshare.csv'.
The task is to build a regression model to predict the total number of bike rentals in a given day (known as the response variable) based on attributes about the day (known as the descriptors). Such a demand forecasting model would be useful in planning the number of bikes that need to be available in the system on any given day, and also in monitoring traffic in the city.
Description of variables
- season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
- month (1 through 12, with 1 denoting Jan)
- holiday (1 = the day is a holiday, 0 = otherwise)
- day_of_week (0 through 6, with 0 denoting Sunday)
- workingday (1 = the day is neither a holiday or weekend, 0 = otherwise)
- weather
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp (temperature in Celsius)
- atemp (apparent, or relative outdoor, or real feel temperature, in Celsius)
- humidity (relative humidity)
- windspeed (wind speed)
- count (response variable i.e. total number of bike rentals on the day)
Load the BikeShare dataset
bikeshare = pd.read_csv('../data/bikeshare.csv')
print("Length of Dataset:",len(bikeshare))
display(bikeshare.head())
Drop unnecessary columns
bikeshare = bikeshare.drop(columns=['Unnamed: 0'])
print("Length of Dataset:",len(bikeshare))
display(bikeshare.head())
Use the describe feature of Pandas to summarize data
display(bikeshare.describe())
Note, we can also use the groupby function to look at mean stats, aggregated by month in this case
bikeshare.groupby('month').mean()
Plot the count of bike rentals by month
plt.figure(figsize=[8,5])
plt.plot(bikeshare.groupby('month').mean()['count'],'-ob')
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Total Count of Bikeshare Rentals per Month')
plt.xlim([0,13])
plt.show()
Is there a difference between temp and a_temp? Let's plot them both
plt.figure(figsize=[8,5])
plt.plot(bikeshare['temp'], bikeshare['atemp'],'.-b', alpha = 0.5) # toggle alpha to 1
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs. Temp')
plt.show()
What did we do wrong here? Why does the plot look like this?
Sorting! Whenever your plot makes zig-zag changes across the scale, it is because matplotlib
is trying to connect the points sequentially from the top (using a line plot) and skipping across the scale when $x_{i+1}$ is lower than $x_{i}$. So let's sort.
# Sorting
new = bikeshare.sort_values(['temp'])
plt.figure(figsize=[8,5])
plt.plot(new['temp'], new['atemp'],'-b',alpha=1)
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs Temp')
plt.show()
It still looks weird, why?
Let's have a closer look at the dataframe:
display(new.head(10))
There are multiple atemp
values for each temp
value, which if not sorted will bounce around at the same x-value. Thus, we need to sort both axes simultaneously.
new = bikeshare.sort_values(['temp','atemp'])
plt.figure(figsize=[8,5])
plt.plot(new['temp'], new['atemp'],'-b')
plt.xlabel('Temp')
plt.ylabel('A-Temp')
plt.title('A-Temp vs Temp')
plt.show()
By plotting efficiently, we found an anomaly we would have otherwise overlooked. It looks like there is a problem with the data around temp greater than 30
and atemp less than 10
.
Show all rows in the dataframe where the temp is greater than 30 and the atemp is less than 10
display(bikeshare[(bikeshare['temp'] > 30) & (bikeshare['atemp'] < 10)])
Anomaly! atemp
and temp
are usually lineary related except at this one datapoint. Now, we get to make a judgement call as to whether we should keep the datapoint. For this example, we will drop this datapoint, but we will come back to the question of how to manage abnormal/missing data after the lecture on Missing Data and Imputation.
bikeshare = bikeshare.drop([188])
We can now try what we wrote and we should end up with no rows in the dataframe where the temp is greater than 30 and the atemp is less than 10
display(bikeshare[(bikeshare['temp'] > 30) & (bikeshare['atemp'] < 10)])
This EDA enabled us to spot this anomoly and clean the data (dropping a data point that may have influenced the model).
Splitting up the data into a training set and a test set using the 'train_test_split' function from sklearn¶
Now that we have an idea of what the data looks like, we want to predict the count. We will be randomly splitting up the data into a training and a testing set. Scikit learn (sklearn) has a function that does this for us, called "train_test_split."
What is the need for training and testing data sets? The training set will be used to train the model, while the testing set will be used to quantify how well that model does on data it has never seen before. Evaluating the accuracy of a model on a testing set ensures that the model doesn't overfit our training data.
Why random split? Randomization helps manage uncontrollable confounding variables within the data.
Let us first create a function that will randomly split the data up into a 70-30 split, with 70% of the data going into the training set
# from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(bikeshare, test_size=0.30, random_state=42)
print("Length of Training set = ",len(train_data))
print("Length of Testing set = ",len(test_data))
Random state of 42 is arbitrary (Pavlos favorite number, google for "the number of universe") but fixing this value will produce the same randomization of data every time (useful for homework grading, research, etc.).
Calculate the ratio of the number of points in the training set to the number of points in the testing set to see if we have split the data correctly
print('The training data length is the', 100*len(train_data)/len(bikeshare),'% of the total dataset length.')
print('The testing data length is the', 100*len(test_data)/len(bikeshare),'% of the total dataset length.')
Time for Break Out Room 1¶
Goal: Practice using train_test_split and visualizing data using matplotlib
Directions:
- Load "sim_data.csv" file into dataframe
- Inspect data
- Drop any unncessary columns
- Split data into 80% training data and 20% test data ("80-20 train-test split")
- Plot training and test data in single plot
- Color training data in blue and test data in red. Also, use different markers for the two sets, e.g. o and *
- Include a legend, title, and axes labels!
Hint: dont forget to sort!
# your code here
# %load ../solutions/breakout_1_sol.py
# read in data
data = pd.read_csv("../data/sim_data.csv")
# drop Unnamed column
data.drop(columns=["Unnamed: 0"], inplace=True)
# split into training and testing with 80/20 split, random_state=42
train_data1, test_data1 = train_test_split(data, test_size=0.20, random_state=42)
# plot results 80/20 split
plt.figure(figsize=[8,5])
plt.scatter(train_data1.x, train_data1.y, c="blue", marker='o', label="train data")
plt.scatter(test_data1.x, test_data1.y,c="red", marker='*', label="test data")
plt.xlabel('x')
plt.ylabel('y')
plt.title("80/20 Split")
plt.legend()
plt.show()
Now let's do some modeling¶
General sklearn model fitting code-structure :
#Split Data into Train and Test Set
x_train, y_train = training_data.drop('Response_Variable', axis=1), training_data['Response_Variable']
x_test, y_test = test_data.drop('Response_Variable', axis=1), test_data['Response_Variable']
#Define Model
model = sklearn_model_name(hyper_parameter1 = value1, hyper_parameter2 = value2)
#Fit Model
model.fit(x_train, y_train)
#Get Prediction
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)
#Evaluate Model
r2_train = model.score(y_train, y_pred_train)
r2_test = model.score(y_test, y_pred_test)
#Print Results
print("Score for Model (Training):", r2_train)
print("Score for Model (Testing) :", r2_test)
Every model has a list of hyperparameters that can be set using sklearn for the specific problem. We find optimal hyperparameters through exploration (one way is cross-validation, which we will discuss soon).
model.fit
calculates the parameters of your model corresponding to the training data and hyperparameters you provided.model.predict(X)
is the standard method called to make the model predict values for a specific X. Depending on if you feed x_train or x_test, you will get a y_prediction_train or y_prediction_test respectively.Evaluation of model can vary according to the task at hand i.e. Regression or Classification. For Regression, $R^2$ Score is standard while for Classification, Accuracy (%) is standard.
We will use the temperature parameter to predict total bike rental count. We can use sklearn to implement kNN, fit the model, and use various metrics to assess our accuracy.
# from sklearn.neighbors import KNeighborsRegressor
# Set kNN hyperparameter:
k = 10
# First, we create the classifier object:
neighbors = KNeighborsRegressor(n_neighbors=k)
# Then, we fit the model using x_train as training data and y_train as target values:
neighbors.fit(train_data[['temp']], train_data['count'])
# Retrieve our predictions:
prediction_knn = neighbors.predict(test_data[['temp']])
# This returns the mean accuracy on the given test data and labels, or in other words,
# the R squared value -- A constant model that always predicts the expected value of y,
# disregarding the input features, would get a R^2 score of 1.
r2_train = neighbors.score(train_data[['temp']], train_data['count'])
r2_test = neighbors.score(test_data[['temp']], test_data['count'])
print("Length of Test Data:", len(test_data['count']))
print("R^2 Score of kNN on training set:", r2_train)
print("R^2 Score of kNN on testing set: ", r2_test)
# SubPlots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
# train data
axes[0].set_ylim([0,10000])
axes[0].plot(train_data['temp'], train_data['count'], 'bo', alpha = 0.5, label = 'Data' )
sorted_temp = train_data.sort_values(['temp'])
prediction_knn = neighbors.predict(sorted_temp[['temp']])
axes[0].plot(sorted_temp['temp'], prediction_knn, 'k-', linewidth = 5, markersize = 10, label = 'Prediction')
axes[0].set_xlabel('Temperature', fontsize = 15)
axes[0].set_ylabel('# of Rentals', fontsize = 15)
axes[0].set_title("Train Set", fontsize = 18)
axes[0].legend(loc = 'upper right', fontsize = 12)
# test data
axes[1].set_ylim([0,10000])
axes[1].plot(test_data['temp'], test_data['count'],'r*', alpha = 0.5, label = 'Data' )
sorted_temp = test_data.sort_values(['temp'])
prediction_knn = neighbors.predict(sorted_temp[['temp']])
axes[1].plot(sorted_temp['temp'], prediction_knn, 'g-', linewidth = 5, markersize = 10, label = 'Prediction')
axes[1].set_xlabel('Temperature', fontsize = 15)
axes[1].set_ylabel('# of Rentals', fontsize = 15)
axes[1].set_title("Test Set", fontsize = 18)
axes[1].legend(loc = 'upper right', fontsize = 12)
fig.suptitle("kNN Regression (k={}): Temp vs Rental Count".format(k), fontsize = 20)
plt.show()
Time for Break Out Room 2¶
Goal: Practice using sklearn's kNN regression
Directions:
- Use same dataset from Break Out Room 1 ("sim_data.csv")
- Perform 70-30 train-test split using a random state of 42
- Create a function that implements kNN regression with your choice of k (explore a few different k's)
- Predict on both training and test data
- For all kNN models generated, plot the following on the same plot:
- Original train data = blue
- Original test data = red
- Predicted train data = black
- Predicted test data = green
- Calculate $R^2$ score
Hints:
- dont forget to sort!
- can make plot colors more transparent using "alpha" and lines thicker using "linewidth"
# your code here
# %load ../solutions/breakout_2_sol.py
# from sklearn.neighbors import KNeighborsRegressor
# split into 70/30, random_state=42
sim_train_data, sim_test_data = train_test_split(data, test_size=0.30, random_state=42)
def knn_model(k, train_data, test_data):
# create the classifier object
neighbors = KNeighborsRegressor(n_neighbors=k)
# fit the model using x_train as training data and y_train as target values
neighbors.fit(train_data[['x']], train_data['y'])
sorted_train = train_data.sort_values(['x'])
sorted_test = test_data.sort_values(['x'])
# Retreieve our predictions:
train_preds = neighbors.predict(sorted_train[['x']])
test_preds = neighbors.predict(sorted_test[['x']])
# find r^2
r2_train = neighbors.score(train_data[['x']], train_data['y'])
r2_test = neighbors.score(test_data[['x']], test_data['y'])
print("R^2 Score of kNN on training set with k={}:".format(k), r2_train)
print("R^2 Score of kNN on testing set: with k={}".format(k), r2_test)
return sorted_train, sorted_test, train_preds, test_preds, r2_train, r2_test
def plot_predictions_same_plot(k, train_data,test_data, train_preds, test_preds):
# plot all results on same plot
plt.figure(figsize=[8,6])
plt.plot(train_data['x'], train_data['y'], 'bo', alpha = 0.5, label = 'Train Set' )
plt.plot(train_data['x'], train_preds, 'k-', linewidth = 2, markersize = 10, label = 'Train Preds')
plt.plot(test_data['x'], test_data['y'], 'r*', alpha = 0.5, label = 'Test Set' )
plt.plot(test_data['x'], test_preds, 'g-', linewidth = 2, markersize = 10, label = 'Test Preds')
plt.xlabel('x')
plt.ylabel('y')
plt.title("x vs y kNN Regression (k={})".format(k))
plt.legend()
plt.show()
knn_train_preds = []
knn_test_preds = []
knn_r2_train_scores = []
knn_r2_test_scores = []
for k in [1,10,70]:
sim_sorted_train, sim_sorted_test, sim_train_preds, sim_test_preds, knn_r2_train, knn_r2_test = knn_model(k, sim_train_data,sim_test_data)
plot_predictions_same_plot(k,sim_sorted_train, sim_sorted_test, sim_train_preds, sim_test_preds)
knn_train_preds.append(sim_train_preds)
knn_test_preds.append(sim_test_preds)
knn_r2_train_scores.append(knn_r2_train)
knn_r2_test_scores.append(knn_r2_test)
Linear Regression¶
We just went over the kNN prediction method. Now, we will fit the same data using a linear regression model.
What is the main difference between a kNN model and linear regression model? Linear regression specifies the model (whatever the data is, the model will fit a linear line) whereas kNN learns the model and fits the best curve.
Advantages of linear regression models are that they are very fast and yield an exact optimal solution. For a more in-depth discussion on generalized linear models, please see the Advanced Section on this.
We will use the same training/testing dataset as before and create linear regression objects. We can do this using sklearn (as we did for kNN) as well as with another package called statsmodels.
# Label data as X,Y for ease
x_train, y_train = train_data['temp'], train_data['count']
x_test, y_test = test_data['temp'], test_data['count']
You can also split into train test by x and y using train test split
X_train, X_test, y_train, y_test = train_test_split( bikeshare['temp']), bikeshare['count'] )
Fit a Linear Regression model using sklearn and take a look at the model's parameters
from sklearn.linear_model import LinearRegression
lr_sklearn = LinearRegression().fit(x_train.values.reshape(-1,1), y_train) # x data must be 2D array
print('Coefficients:', lr_sklearn.coef_)
print('Intercept:', lr_sklearn.intercept_)
Note: only one coefficient here since only using one descriptor variable (temp)
Use model to predict on training and testing data and plot prediction
# predict
y_preds_train = lr_sklearn.predict(x_train.values.reshape(-1,1))
y_preds_test = lr_sklearn.predict(x_test.values.reshape(-1,1))
# plot predictions
fig, axes = plt.subplots(1,2,figsize=(20,6))
axes = axes.ravel()
axes[0].scatter(x_train, y_train, color = 'b', alpha = 0.5, label = 'Data')
axes[0].plot(x_train, y_preds_train, 'k', linewidth = 5, label = 'Prediction')
axes[0].set_title('Train Set', fontsize = 18)
axes[1].scatter(x_test, y_test, color = 'r',marker='*', alpha = 0.5, label = 'Data')
axes[1].plot(x_test, y_preds_test, 'g', linewidth = 5, label = 'Prediction')
axes[1].set_title('Test Set', fontsize = 18)
for i, ax in enumerate(axes):
ax.set_ylim(0,10000)
ax.set_xlabel('Temperature', fontsize = 15)
ax.set_ylabel('# of Rentals', fontsize = 15)
ax.legend(loc = 'upper right', fontsize = 12)
fig.suptitle("Linear Regression: Temp vs Rental Count", fontsize = 20)
plt.show()
Compute performance metrics for both training and testing
# from sklearn import metrics
# Mean Squared Error (MSE)
print("MSE Train: {:.3f}".format(metrics.mean_squared_error(y_train, y_preds_train)))
print("MSE Test: {:.3f}".format(metrics.mean_squared_error(y_test, y_preds_test)))
# R^2 score
print("R^2 Train: {:.3f}".format(metrics.r2_score(y_train, y_preds_train)))
print("R^2 Test: {:.3f}".format(metrics.r2_score(y_test, y_preds_test)))
Recall that more accurate models will have higher $R^2$ scores (value of 1 is perfect fitted line) and lower MSEs (meaning lower error). Notice that the $R^2$ for training is higher and MSE is lower than that of the test set, indicating some overfitting to the training set. For more info on these, check out sklearn metrics documentation. Take a look at the end of the notebook for calculations of MSE and $R^2$ metrics by hand.
Linear Regression using statsmodels¶
Fit a Linear Regression model using statsmodels and print out the coefficients of temp
and const
Note:
- OLS = ordinary least squares
- Must add constants to X data since an intercept is not included by default (unlike sklearn linear regression)
- statsmodels OLS first entry is the response variable (Y) followed by X whereas sklearn uses X followed by Y structure
# from statsmodels.api import OLS
# import statsmodels.api as sm
# Add constant to x data
x_train_ca = sm.add_constant(x_train)
x_test_ca = sm.add_constant(x_test)
# Create Linear Regression object
model = sm.OLS(y_train, x_train_ca)
# Fit
results = model.fit()
print(results.params)
Use the fitted model (saved as "results") to predict on train and test data and plot prediction
# Plotting our model
fig, axes = plt.subplots(1,2,figsize=(20,6))
axes = axes.ravel()
# train data
axes[0].plot(x_train, y_train, 'bo', alpha = 0.5, label = 'Data')
sorted_temp = train_data.sort_values(['temp'])
prediction_lr = results.predict(sm.add_constant(sorted_temp[['temp']]))
axes[0].plot(sorted_temp['temp'], prediction_lr, 'k-', linewidth = 5, label = 'Prediction')
axes[0].set_title('Train Set', fontsize = 18)
# test data
axes[1].plot(x_test, y_test, 'r*', alpha = 0.5, label = 'Data')
sorted_temp = test_data.sort_values(['temp'])
prediction_lr = results.predict(sm.add_constant(sorted_temp[['temp']]))
axes[1].plot(sorted_temp['temp'], prediction_lr, 'g-', linewidth = 5, label = 'Prediction')
axes[1].set_title('Test Set', fontsize = 18)
for i, ax in enumerate(axes):
ax.set_ylim(0,10000)
ax.set_xlabel('Temperature', fontsize = 15)
ax.set_ylabel('# of Rentals', fontsize = 15)
ax.legend(loc = 'upper right', fontsize = 12)
fig.suptitle('Linear Regression: Temp vs Rental Count (statsmodels)', fontsize = 20)
plt.show()
Check out $R^2$ (remember 1 is perfect prediction)
print("R^2 Score Train (statsmodels linear regression):", metrics.r2_score(y_train, results.predict(x_train_ca)))
print("R^2 Score Test (statsmodels linear regression):", metrics.r2_score(y_test, results.predict(x_test_ca)))
Notice that these $R^2$ scores are indeed the same as what we found above using sklearn's linear regression.
Check out results.summary()
and see the nice table that statsmodels provides!
results.summary()
Time for Break Out Room 3¶
Goal: Compare kNN and linear regression for the same dataset
Directions:
- Use same dataset from previous Break Out Rooms, with 70-30 train-test split and random state of 42
- Create a function that implements linear regression with sklearn or statsmodels
- Predict on both training and test data
- Create 2 subplots with the following plotted:
- Subplot 1: Train set
- Plot training data in blue
- Plot linear regression prediction in black
- Plot kNN prediction (using k = 10) in magenta ('m')
- Subplot 2: Test set
- Plot testing data in red
- Plot linear regression prediction in green
- Plot kNN prediction (using k = 10) in yellow ('y')
- Subplot 1: Train set
- Calculate $R^2$ scores for both train and test sets for both kNN and linear regression
Hints:
- don't forget sort!
- plt.subplots(...) creates subplots
# %load ../solutions/breakout_3_sol.py
# linear regression
# from statsmodels.api import OLS
# import statsmodels.api as sm
def linreg_model(train_data,test_data):
# sort
sorted_train = train_data.sort_values(['x'])
sorted_test = test_data.sort_values(['x'])
x_train, x_test, y_train, y_test = sorted_train['x'], sorted_test['x'], sorted_train['y'], sorted_test['y']
# Add constant to x data
x_train_ca = sm.add_constant(x_train)
x_test_ca = sm.add_constant(x_test)
# Create Linear Regression object
linreg_model = sm.OLS(y_train, x_train_ca)
# Fit
results = linreg_model.fit()
# predict
train_preds = results.predict(x_train_ca)
test_preds = results.predict(x_test_ca)
# find r^2
r2_train = metrics.r2_score(y_train, results.predict(x_train_ca))
r2_test = metrics.r2_score(y_test, results.predict(x_test_ca))
return train_preds, test_preds, r2_train, r2_test
def plot_predictions2(k,train_data,test_data, knn_train_preds, knn_test_preds, linreg_train_preds, linreg_test_preds):
# SubPlots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
axes[0].plot(train_data['x'], train_data['y'], 'bo',alpha = 0.5, label = 'Data' )
axes[0].plot(train_data['x'], knn_train_preds,'m-', linewidth = 2, markersize = 10, label = 'KNN Preds')
axes[0].plot(train_data['x'], linreg_train_preds,'k-', linewidth = 2, markersize = 10, label = 'Linreg Preds')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title("Train Data")
axes[0].legend()
axes[1].plot(test_data['x'], test_data['y'], 'r*', alpha = 0.5, label = 'Data' )
axes[1].plot(test_data['x'], knn_test_preds, 'y-', linewidth = 2, markersize = 10, label = 'KNN Preds')
axes[1].plot(test_data['x'], linreg_test_preds, 'g-', linewidth = 2, markersize = 10,label = 'Test Preds')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title("Test Data")
axes[1].legend()
fig.suptitle("KNN vs Linear Regression")
plt.show()
# get predictions
linreg_train_preds,linreg_test_preds, linreg_r2_train, linreg_r2_test = linreg_model(sim_train_data,sim_test_data)
# plot linreg predictions side by side with knn predictions
k=10
plot_predictions2(k, sim_sorted_train, sim_sorted_test, knn_train_preds[1], knn_test_preds[1], linreg_train_preds, linreg_test_preds)
# print r2 scores for knn with k=10 and linreg
print("R^2 Score of kNN on training set with k={}:".format(k), knn_r2_train_scores[1])
print("R^2 Score of kNN on testing set: with k={}".format(k), knn_r2_test_scores[1])
print("R^2 Score of linear regression on training set", linreg_r2_train)
print("R^2 Score of linear regression on testing set", linreg_r2_test)
Understanding model uncertainty¶
Confidence Intervals¶
In Data Science, a confidence interval (CI) is a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter. Simply speaking, a Confidence Interval is a range of values we are fairly sure our true value lies in.
It is important to remind ourselves here that Confidence Intervals belong to a parameter and not a statistic. Thus, they represent the window in which the true value exists for the entire population when all we have is a sample.
See if you can implement a 95% confidence interval using statsmodels
# Confidence Interval using Stats Model Summary
thresh = 0.05
intervals = results.conf_int(alpha=thresh)
# Renaming column names
first_col = str(thresh/2*100)+"%"
second_col = str((1-thresh/2)*100)+"%"
intervals = intervals.rename(columns={0:first_col,1:second_col})
display(intervals)
In the above block of code, results.conf_int(alpha=thresh)
returns a dataframe with columns 0 and 1. We explained Confidence Intervals above where because we assume normal symetric distribution of data, the 95% Confidence Interval means there's 2.5% chance of the true value lying below the values in Column 0 and 2.5% chance of the true value lying above Column 1.
Extra: Train-Test Split using a mask¶
#Function to Split data into Train and Test Set
def split_data(data):
#Calculate Length of Dataset
length = len(data)
#Define Split
split = 0.7
#Set a random Seed For Shuffling
np.random.seed(9001)
#Generate a Mask with a X:Y Split
mask = np.random.rand(length) < split
#Separate train and test data
data_train = data[mask]
data_test = data[~mask]
#Return Separately
return data_train, data_test
#Split data using defined function
train_data_manual, test_data_manual = split_data(bikeshare)
print("Length of Training set:",len(train_data_manual))
print("Length of Testing set:",len(test_data_manual))
## Check that the ratio between test and train sets is right
test_data_manual.shape[0]/(test_data_manual.shape[0]+train_data_manual.shape[0])
Extra: Implementing the kNN Algorithm by hand¶
To really understand how the kNN algorithm works, it helps to go through the algorithm line by line in code.
#kNN Algorithm
def knn_algorithm(train, test, k):
#Create any empty list to store our predictions in
predictions = []
#Separate the response and predictor variables from training and test set:
train_x = train['temp']
train_y = train['count']
test_x = test['temp']
test_y = test['count']
for i, ele in enumerate(test_x):
#For each test point, store the distance between all training points and test point
distances = pd.DataFrame((train_x.values - ele)**2 , index=train.index)
distances.columns =['dist']
#display(distances)
#Then, we sum across the columns per row to obtain the Euclidean distance squared
##distances = vec_distances.sum(axis = 1)
#Sort the distances to training points (in ascending order) and take first k points
nearest_k = distances.sort_values(by='dist').iloc[:k]
#For simplicity, we omitted the square rooting of the Euclidean distance because the
#square root function preserves order.
#Take the mean of the y-values of training set corresponding to the nearest k points
k_mean = train_y[nearest_k.index].mean()
#Add on the mean to our predicted y-value list
predictions.append(k_mean)
#Create a dataframe with the x-values from test and predicted y-values
predict = test.copy()
predict['predicted_count'] = pd.Series(predictions, index=test.index)
return predict
Now to run the algorithm on our dataset with $k = 5$:
#Run the kNN function
k = 5
predicted_knn = knn_algorithm(train_data, test_data, k)
predicted_knn.head()
We want to have a way to evaluate our predictions from the kNN algorithm with $k=5$. One way is to compute the $R^2$ coefficient. Let's create a function for that:
#Test predictions in comparison to true value of test set
def evaluate(predicted, true):
#Find the squared error:
squared_error = (predicted['predicted_count'] - true['count'])**2
#Finding the mean squared error:
error_var = squared_error.sum()
sample_var = ((true['count'] - true['count'].mean())**2).sum()
r = (1 - (error_var / sample_var))
return r
Then let's apply this function to our predictions:
print("Length of Test Data:",len(test_data))
print("R^2 Score of kNN test:", evaluate(predicted_knn, test_data))
predicted_knn_train = knn_algorithm(test_data, train_data, k)
print("R^2 Score of kNN train:", evaluate(predicted_knn_train, train_data))
Extra: Computing different performance metrics by hand¶
Now, we will compute metrics that can be used to assess fit.
Note: sklearn.metrics is class of functions that consists of all the metrics we care about to evaluate our models. While it is not hard to implement them yourself, it is helpful to go through http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
model = sm.OLS(y_train, x_train_ca)
results = model.fit()
#Predict on train and test
y_pred_train = results.predict(x_train_ca)
y_pred_test = results.predict(x_test_ca)
#Calc squared error
squared_error_train = (y_pred_train - y_train)**2
squared_error_test = (y_pred_test - y_test)**2
#Calc mean squared error
error_var_train = squared_error_train.mean()
error_var_test = squared_error_test.mean()
#Calc variance
sample_var_train = ((y_train - y_train.mean())**2).mean()
sample_var_test = ((y_test - y_test.mean())**2).mean()
#Calc R^2
r_sq_train = 1 - error_var_train/sample_var_train
r_sq_test = 1 - error_var_test/sample_var_test
print('MSE train:', error_var_train, 'R^2 train:', r_sq_train)
print('MSE test:', error_var_test, 'R^2 test:', r_sq_test)