Key Word(s): models, parametric, nonparametric, debugging, training data, testing data

Title¶

Exercise: 1 - Debugging

Description¶

For this exercise, we will be using the Boston house prices dataset Boston house prices dataset that comes with sklearn. Specifically, we only care about 1 feature (aka predictor), the CRIM (crime) column, which is the very 1st column in the dataset and represents the crime per capital in each town. More info about the dataset and its columns are available here.

First, run every cell. Try to find the errors and fix them. To be clear, bugs may exist in lines of code that are commented or not commented. Bugs may require you to add new lines of code or delete some lines of code. Also, our comments represent what we (the hypothetical student) wish for our code to do, so ensure that the code agrees with our comments. In other words, the bugs aren't in the comments, but are in the code.

NOTE: The funky comment lines like ### edTest(test_a) ### allow us to unit test and evaluate your code. Ignore those lines; do not edit them. Upon 'marking' your code, if you see a particular unit test doesn't pass, you can pinpoint the erroneous cell by its funky name (e.g., test_a)

CS109A Introduction to Data Science

Lecture 13, Exercise 1: Debugging¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner

In [ ]:

###########################################
##   DO NOT EDIT ANYTHING IN THIS CELL   ##
###########################################
import matplotlib.pyplot as plt
import math
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

# ensures consistency between everyone's runs
np.random.seed(100)
###########################################
##   DO NOT EDIT ANYTHING IN THIS CELL   ##
###########################################

REMINDER: After running every cell, be sure to auto-grade your work by clicking 'Mark' in the lower-right corner. Otherwise, no credit will be given.¶

In [ ]:

# function to plot data with model predictions
def plot_model(x, y_true, y_pred):
    plt.figure()
    ax = plt.gca()
    ax.scatter(x, y_true)

    # sort both x and y_pred for plot
    y_pred_sorted = [y for _,y in sorted(zip(x, y_pred))]
    x_sorted = sorted(x)
    ax.plot(x_sorted, y_pred_sorted, "C1", linewidth=4)

    # label the x-axis "Crime" and the y-axis "Median Value"
    ax.xlabel("Crime")
    ax.xlabel("Median Value")
    plt.show()
    
    return ax ### DO NOT EDIT THIS LINE

In [ ]:

### edTest(test_a) ###
x, y = load_boston(return_X_y=True)

# let's use just the 1st column (crime statistics)
x = x[:,0]

# create a random train/test split with the test size being 0.2 and a random_state of 2
x_train, x_test = train_test_split(x, test_size=0.2, random_state=2)
y_train, y_test = train_test_split(y, test_size=0.2, random_state=2)

In [ ]:

### edTest(test_b) ###

# create polynomial (cubic) features 
poly = PolynomialFeatures(3)
x_poly_train = poly.fit_transform(x_train)
x_poly_test = poly.transform(x_test)

# fit the model
lr = LinearRegression()
lr.fit(x_poly_train, y_train)

In [ ]:

### edTest(test_c) ###

# make predictions
y_pred = lr.predict(x_poly_train)

# plot the model
# NOTE: WHATEVER CODE YOU CHOOSE TO WRITE, MAKE SURE YOU ASSIGN
# `ax` to equal the return value of plot_model()
# otherwise, our unit test will be unable to auto-grade your work
ax = plot_model(x_poly_train, y_train, y_pred)

NOTE: The above cell checks to see if your predictions are good, but we cannot auto-check to see if your plot values are correct. Your plot's values should look as follows:

In [ ]:

# inputs: a polynomial linear regression model
# purpose: prints the intercept and coefficients of model
# returns: the MSE of the polynomial model
def get_metrics(linreg):

    # view coefficients
    print("Intersects:", lr.intercept)
    print("Coefficients:", lr.coef)
    
    # calculate test MSE
    mse = mean_squared_error(y_test, y_pred)

    return mse ## DO NOT EDIT THIS LINE

In [ ]:

### edTest(test_d) ###
mse = get_metrics(lr) ## DO NOT EDIT THIS LINE