Exercise: B.1 - Finding best k in kNN regression


The goal here is to find the value of k of the best performing model based on the test MSE.


  • Read the data into a dataframe object using pandas.read_csv
  • Select the sales column as the response variable and TV budget column as the predictor variable
  • Make a train-test split using sklearn.model_selection.train_test_split
  • Create a list of integer k values using numpy.linspace
  • For each value of k
    • Fit a knn regression on train set
    • Calculate MSE on test set and store it
  • Plot the test MSE values for each k
  • Find the k value associated with the lowest test MSE


train_test_split(X,y) : Split arrays or matrices into random train and test subsets.

np.linspace() : Returns evenly spaced numbers over a specified interval.

KNeighborsRegressor(n_neighbors=k_value) : Regression-based on k-nearest neighbors.

mean_squared_error() : Computes the mean squared error regression loss.

dict.keys() : returns a view object that displays a list of all the keys in the dictionary.

dict.values() : returns a list of all the values available in a given dictionary.

dict.items() : returns a list of dict's (key, value) tuple pairs

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

%matplotlib inline

Reading the standard Advertising dataset

In [ ]:
# Data set used in this exercise

data_filename = 'Advertising.csv'

# Read advertising.csv file using the pandas library (using pandas.read_csv)
df = pd.read_csv(data_filename)
In [ ]:
# again, take a quick look at the data

In [3]:
# Select the 'TV' column as predictor variable and 'Sales' column as response variable 

x = df[[___]]
y = df[___]

Train-Test split

In [21]:
### edTest(test_shape) ###
# Split the dataset in training and testing with 60% training set and 40% testing set 

x_train, x_test, y_train, y_test = train_test_split(___,___,train_size=___,random_state=66)
In [28]:
### edTest(test_nums) ###
# Choosing k range from 1 to 70
k_value_min = 1
k_value_max = 70

# creating list of integer k values betwwen k_value_min and k_value_max using linspace
k_list = np.linspace(k_value_min,k_value_max,num=70,dtype=int)

Model fit

In [29]:
# hint mean_squared_error
In [ ]:
fig, ax = plt.subplots(figsize=(10,6))

# creating a dictionary for storing k value against MSE fit {k: MSE@k} 

knn_dict = {}
# Looping over k values
for k_value in k_list:   
    # creating KNN Regression model 
    model = KNeighborsRegressor(n_neighbors=int(___))
    # fitting model 
    # predictions
    y_pred = model.predict(___)
    # Calculating MSE 
    MSE = ____

    #Storing the MSE values of each k value in a dictionary
    knn_dict[k_value] = ___
    ## Plotting
    colors = ['grey','r','b']
    if k_value in [1,10,70]:
        xvals = np.linspace(x.min(),x.max(),100)
        ypreds = model.predict(xvals)
        ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])
ax.legend(loc='lower right',fontsize=20)
ax.plot(x_train, y_train,'x',label='test',color='k')
ax.set_xlabel('TV budget in $1000',fontsize=20)
ax.set_ylabel('Sales in $1000',fontsize=20)

Graph plot

In [ ]:
# Plot k against MSE

plt.plot(___, ___,'k.-',alpha=0.5,linewidth=2)

plt.ylabel('MSE',fontsize = 20)
plt.title('Test $MSE$ values for different k values - KNN regression',fontsize=20)

Find the best knn model

In [ ]:
### edTest(test_mse) ###

# Looking for k with minimum MSE

min_mse = min(___)
best_model = ___       # HINT YOU MAY USE LIST COMPREHENSION 
print ("The best k value is ",best_model,"with a MSE of ", min_mse)

How good is your model?

From the options below, how would you classify your model?

  1. Good
  2. Satisfactory
  3. Bad
In [ ]:
# Your answer here 
In [ ]:
# Run this cell to calculate the R2_score of your best model
model = KNeighborsRegressor(n_neighbors=best_model[0])
y_pred_test = model.predict(x_test)

print(f"The R2 score for your model is {r2_score(y_test, y_pred_test)}")

After observing the $R^2$ value, how would you now classify your model?

  1. Good
  2. Satisfactory
  3. Bad
In [1]:
# your answer here