CS 109A/STAT 121A/AC 209A/CSCI E-109A

Standard Section 3: Prediction using kNN and Regression Methods

Harvard University
Fall 2017
Section Leaders: Nathaniel Burbank, Albert Wu, Matthew Holman
Instructors: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

**Download this notebook from the CS109 repo or here:**
**http://bit.ly/109_S3**

For this section, our goal is to get you familiarized with k-Nearest Neighbors, Linear, and Polynomial Regression. These methods find powerful applications in all walks of life and are centered around prediction.

Specifically, we will:

1. Review some data selection basics
2. Load in the iris dataset which is split into a training and testing dataset
3. Do some basic exploratory analysis of the dataset and go through a scatterplot
4. Write out the algorithm for kNN WITHOUT using the sklearn package
5. Use the sklearn package to implement kNN and compare to the one we did by hand
6. Extend the sklearn package to linear and polynomial regression 

For this section we will be using the following packages:

In [ ]:
import sys
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('display.width', 500)
pd.set_option('display.notebook_repr_html', True)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Note --  Requires sklearn version .18 or higher  
from sklearn import metrics, datasets
from collections import Counter
import statsmodels.api as sm
from statsmodels.api import OLS
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

assert(sys.version_info.major==3),print(sys.version)
# Python 3 or higher is required

Review – Python list comprehensions

In [ ]:
# Makes a list of all the even numbers from 0 to 20 
# with a for loop
l = []
for i in range(20):
    if i%2 ==0:
        l.append(i)
l

Makes a list of even numbers from 0 to 20 using a python list comprehension

In [ ]:
## Your code here

Selecting by assigned index vs integer position

In [ ]:
data_url = 'https://raw.githubusercontent.com/nathanielburbank/CS109/master/data/states.csv'
states = pd.read_csv(data_url,index_col=0)
states.head()

Select just the columns from 1930 to 1960

In [ ]:
states[['1930','1940','1950','1960']].head()

Select just New York, New Jersey, and Connecticut

In [ ]:
tristate = ['New York','New Jersey','Connecticut']
## Your code here

Select just New York, New Jersey, and Connecticut from 1930 to 1960

In [ ]:
## Your code here

Select the states between Nevada and New York, from 1930 to 1960

In [ ]:
## Your code here

Select rows 10 to 15 by positon

In [ ]:
## Your code here

Select rows 10 to 15, and the last five columns (by position)

In [ ]:
## Your code here

Load in the iris dataset and EDA:

The iris dataset can be found within the sklearn package and contains measurement data for three types of Iris' (a kind of flower): 1) Setosa, 2) Versicolour, and 3) Virginica. For each type of Iris, we have recorded the sepal length, sepal width, petal length, and petal width in centimeters. (The sepal can be basically thought of as the outer-most petal of a flower). These four measurements were done on 50 unique Setosa, 50 unique Versicolour, and 50 unique Virginia flowers for a total of 150 unique flower measurements.

In the dataset below, we will let the target variable designate the flower type by letting 0 represent Setosa, 1 represent Versicolour, and 2 represent Virginica. Now, we will load in the dataset:

In [ ]:
# Load in the dataset, which is contained in the sklearn package
# Inital version of dataset is in dict-like container object
iris_bunch = datasets.load_iris()

# np.c_ is the numpy concatenate function which combines the data array and target array.
# The target array is our "Y" variable and the data array are the "X" variables. 
iris = pd.DataFrame(data= np.c_[iris_bunch['data'], iris_bunch ['target']],
                     columns= iris_bunch['feature_names'] + ['target'])

iris.head()

We can use the describe() function to summarize our dataset. When using the describe() function, some care should be taken to interpret the values based on what the data represent. For example, the count row shows we have 150 observations. The target column doesn't have too much interpretive value as the other columns as we are letting 0, 1, and 2 be categorical variables indicating which flower type was measured.

In [ ]:
iris.describe()

We can also use the groupby function to look at mean stats aggregated by flower type

In [ ]:
iris.groupby('target').mean()

In general, it is good practice to normalize data before proceeding. As such, we can create the following functions:

In [ ]:
# Takes in a dataset and normalizes it. 
def normalize(x):
    num = x - np.min(x)
    denom = np.max(x) - np.min(x)
    return (num / denom)

iris.iloc[:, 0:4] = normalize(iris.iloc[:, 0:4])

iris.head()

Let's see what the describe function AFTER normalization:

In [ ]:
iris.describe()

We can also use the pairplot() function to create a scatterplot matrix of our data:

In [ ]:
features_cols = iris.columns[:4]
sns.pairplot(data=iris,hue='target',vars=features_cols)

Split up the data into a training set and a test set:

Now that we have an idea of what the data looks like, we would like to create a way to predict flower type (Setosa, Versicolour, or Virginica) based on our 4 predictor variables. Whatever method we use, it would be nice to have a way to assess how accurate our model is. Therefore, we will be breaking up the data into a training and a testing set. The training set will be used to train the model, while the testing set will be used to gauge how well our model does in general. The testing set is a way for us to ensure our model doesn't overfit our data.

Let us first create a function that will randomly split the data up into a 70-30 split, with 70% of the data going into the testing set:

In [ ]:
def split_data(data):
    # Determine the number of observations we have in our entire data set:
    # YOUR CODE GOES HERE
    
    # Create a list of integer indices ranging over our number of observations:
    # YOUR CODE GOES HERE
    
    # Use numpy's random.shuffle() function to randomly shuffle over our index:
    # YOUR CODE GOES HERE
    
    # Create a list for the first 70% of the shuffled indices and set to training: 
    # YOUR CODE GOES HERE
    
    # Create a list for the remaining 30% of the shuffled indices and set to testing:
    # YOUR CODE GOES HERE
    
    # Use the list of training indices to find the corresponding data entries:
    # YOUR CODE GOES HERE
    
    # Use the list of testing indices to find the corresponding data entries:
    # YOUR CODE GOES HERE
    
    # Return two dataframes, one with the testing data and one with the training data:
    return train, test

We will now run the function and see if it returns actually what we want:

In [ ]:
iris_train,iris_test  = split_data(iris)

# Return the dimensions of our training dataframe after using the split_data function:
# YOUR CODE GOES HERE

Alternative approach using train_test_split from sklearn

In [ ]:
train, test = train_test_split(iris, test_size=0.3)
train.shape

Implementing the kNN Algorithm by hand:

To really understand how the kNN algorithm works, it helps to go through the algorithm line by line in code.

In [ ]:
def knn_algorithm(train, test, k):
    
    # Create any empty list to store our predictions in:
    predictions = []
    
    predictor_cols = [col for col in train.columns if col != 'target']
    
    # Separate the response and predictor variables from training and test set:
    train_x = train[predictor_cols]
    train_y = train['target']
    test_x  = test[predictor_cols]
    test_y  = test['target']
    
    for index, row in test_x.iterrows():

        # For each test point, store the distance between all training points and test point
        vec_distances = pd.DataFrame((train_x.values - row.values)**2, index=train.index, columns = train_x.columns)

        # Then, we sum across the columns per row to obtain the Euclidean distance squared
        distances = vec_distances.sum(axis = 1)
        
        # Sort the distances to training points (in ascending order) and take first k points
        nearest_k = distances.sort_values().iloc[:k]
        # For simplicity, we omitted the square rooting of the Euclidean distance because the
        # square root function preserves order. 
        
        # Take the mean of the y-values of training set corresponding to the nearest k points
        k_mean = train_y[nearest_k.index].mean()
        
        # Add on the mean to our predicted y-value list
        predictions.append(k_mean)
    
    # Create a dataframe with the x-values from test and predicted y-values  
    predict = test.copy()  
    predict['target'] = pd.Series(predictions, index=test.index)
    
    return predict

Now to run the algorithm on our dataset with $k = 5$:

In [ ]:
k = 5
predicted_knn = knn_algorithm(iris_train, iris_test, k)
predicted_knn

We want to have a way to evaluate our predictions from the kNN algorithm with $k=5$. One way is to compute the $R^2$ coefficient. Let's create a function for that:

In [ ]:
def evaluate(predicted, true):
    # Find the squared error:
    squared_error = (predicted['target'] - true['target'])**2
    
    # Finding the mean squared error:
    error_var = squared_error.sum()
    sample_var = ((true['target'] - true['target'].mean())**2).sum()
    
    r = (1 - (error_var / sample_var))
    
    return r

Then let's apply this function to our predictions:

In [ ]:
evaluate(predicted_knn, iris_test)

We see that the coefficient for the Nearest Neighbors implementation with $k=5$ is $R^2 = 0.9745$, which should more or less match what we get with the sklearn package.

Now using sklearn to implement kNN:

We will now use the sklearn package to implement kNN:

Here, we will split our data using the train_test_split function from sklearn.

In [ ]:
# Now we can use sklearn's train_test_split function to split our data:
train, test =  train_test_split(iris, test_size=.3)

x_train, x_test = train[features_cols], test[features_cols]
y_train, y_test = train['target'], test['target']

Then, we can fit the model and use various metrics to assess our accuracy:

We can also introduce a Confusion Matrix:

In [ ]:
# Set kNN parameter:
k = 10

# Now we can fit the model, predict our variable of interest, and then evaluate our fit:
# First, we create the classifier object:
neighbors = KNeighborsClassifier(n_neighbors=k)

# Then, we fit the model using x_train as training data and y_train as target values:
neighbors.fit(x_train, y_train)

# Retreieve our predictions:
prediction_knn = neighbors.predict(x_test)

# This returns the mean accuracy on the given test data and labels, or in other words, 
# the R squared value -- A constant model that always predicts the expected value of y, 
# disregarding the input features, would get a R^2 score of 1.
r = neighbors.score(x_test, y_test)
r
In [ ]:
expected_knn = y_test
predicted_knn = neighbors.predict(x_test)
print(metrics.classification_report(expected_knn, predicted_knn))
In [ ]:
pd.DataFrame(metrics.confusion_matrix(expected_knn, predicted_knn))

In the above, 4 flowers belonging to class 3 were mis-classified as class 2, while 2 flowers of class 2 were mis-classified as class 3. A confusion matrix allows us to view where our inaccurate predictions lie in a simple "snapshot" style matrix.

Linear and Polynomial Regression

We just went over the kNN prediction method. Now, we will fit the same data, but onto linear and polynomial regressions.

Linear Regression:

We will use the training/testing dataset as before and create our linear regression objects.

In [ ]:
from statsmodels.api import OLS
# We must first create the linear regression object from sklearn:
regr = LinearRegression()
# Then, we will put in the training sets in for the .fit() function:
regr.fit(x_train, y_train)
# This prints the regression coefficients of our model:
print(regr.coef_)
In [ ]:
import statsmodels.api as sm
# We must first create the linear regression object from stats model:
model = sm.OLS(y_train.values, x_train)
regr = model.fit()
# This prints the regression coefficients of our model:
regr.params

Now, we will compute metrics that can be used to assess fit:

In [ ]:
# To compute the mean squared error (notice that we are now using the TEST set):
np.mean((regr.predict(x_test)-y_test)**2)
In [ ]:
regr.summary()

Instead of focusing on $R^2$, let’s look at the classification stats..

In [ ]:
predicted_knn = np.round(regr.predict(x_test))
print(metrics.classification_report(expected_knn, predicted_knn))
In [ ]:
pd.DataFrame(metrics.confusion_matrix(expected_knn, predicted_knn))

Polynomial Regression

Polynomial regression is useful when you suspect a non-linear relationship between the predictor variables $x$ and the conditional expectation of $y$. Specifically, it is a special case of linear regression where the predictor variables are modeled through an $n$th degree polynomial. In Python, we can create the polynomial features through scikit-learn's PolynomialFeatures package. Then, we can use linear regression to implement a polynomial regression model.

We first need to create the PolynomialFeatures object and specify to what degree we wish to take our polynomial to:

In [ ]:
# Create the PolynomialFeatures object and specify the number of degrees:
degree = 2
poly = PolynomialFeatures(degree)

x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.fit_transform(x_test)
pd.DataFrame(x_train_poly).shape
In [ ]:
# Create a linear regression object
lg = LinearRegression()

# Fit our training data with polynomial features 
lg.fit(x_train_poly, y_train)

# Obtain coefficients
lg.coef_

Now we can also do prediction:

In [ ]:
# Predict from our fitted polynomial regression model based on our test set. 
poly.predicted = lg.predict(x_test_poly)
poly.predicted
In [ ]:
# Computer the Mean Squared Error for Polynomial Regression:
np.mean((poly.predicted-y_test)**2)
In [ ]:
predicted_knn = np.round(poly.predicted)
print(metrics.classification_report(expected_knn, predicted_knn))
In [ ]:
pd.DataFrame(metrics.confusion_matrix(expected_knn, predicted_knn))