{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise: Finding the Best k in kNN Regression\n",
"\n",
"## Description :\n",
"The goal here is to find the value of k of the best performing model based on the test MSE.\n",
"\n",
"
\n",
"\n",
"## Data Description:\n",
"\n",
"## Instructions:\n",
"- Read the data into a Pandas dataframe object. \n",
"- Select the sales column as the response variable and TV budget column as the predictor variable.\n",
"- Make a train-test split using sklearn.model_selection.train_test_split .\n",
"- Create a list of integer k values using numpy.linspace .\n",
"- For each value of k\n",
" - Fit a kNN regression on train set.\n",
" - Calculate MSE on test set and store it.\n",
"- Plot the test MSE values for each k.\n",
"- Find the k value associated with the lowest test MSE.\n",
"\n",
"\n",
"## Hints: \n",
"\n",
"train_test_split(X,y)\n",
"Split arrays or matrices into random train and test subsets. \n",
"\n",
"np.linspace()\n",
"Returns evenly spaced numbers over a specified interval.\n",
"\n",
"KNeighborsRegressor(n_neighbors=k_value)\n",
"Regression-based on k-nearest neighbors. \n",
"\n",
"model.predict()\n",
"Predict the target for the provided data.\n",
"\n",
"mean_squared_error()\n",
"Computes the mean squared error regression loss.\n",
"\n",
"dict.keys()\n",
"Returns a view object that displays a list of all the keys in the dictionary.\n",
"\n",
"dict.values()\n",
"Returns a list of all the values available in a given dictionary.\n",
"\n",
"plt.plot()\n",
"Plot y versus x as lines and/or markers.\n",
"\n",
"dict.items()\n",
"Returns a list of dict's (key, value) tuple pairs.\n",
"\n",
"\n",
"**Note:** This exercise is auto-graded and you can try multiple attempts. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import numpy as np\n",
"import pandas as pd \n",
"import matplotlib.pyplot as plt\n",
"from sklearn.utils import shuffle\n",
"from sklearn.metrics import r2_score\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.neighbors import KNeighborsRegressor\n",
"from sklearn.model_selection import train_test_split\n",
"%matplotlib inline\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading the standard Advertising dataset"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Read the file 'Advertising.csv' into a Pandas dataset\n",
"df = pd.read_csv('Advertising.csv')\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Take a quick look at the data\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Set the 'TV' column as predictor variable\n",
"x = df[[___]]\n",
"\n",
"# Set the 'Sales' column as response variable \n",
"y = df[___]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train-Test split"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_shape) ###\n",
"# Split the dataset in training and testing with 60% training set and \n",
"# 40% testing set \n",
"x_train, x_test, y_train, y_test = train_test_split(___,___,train_size=___,random_state=66)\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_nums) ###\n",
"# Choosing k range from 1 to 70\n",
"k_value_min = 1\n",
"k_value_max = 70\n",
"\n",
"# Create a list of integer k values between k_value_min and \n",
"# k_value_max using linspace\n",
"k_list = np.linspace(k_value_min,k_value_max,num=70,dtype=int)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model fit"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Setup a grid for plotting the data and predictions\n",
"fig, ax = plt.subplots(figsize=(10,6))\n",
"\n",
"# Create a dictionary to store the k value against MSE fit {k: MSE@k} \n",
"knn_dict = {}\n",
"\n",
"# Variable used for altering the linewidth of values kNN models\n",
"j=0\n",
"\n",
"# Loop over all k values\n",
"for k_value in k_list: \n",
" \n",
" # Create a KNN Regression model for the current k\n",
" model = KNeighborsRegressor(n_neighbors=int(___))\n",
" \n",
" # Fit the model on the train data\n",
" model.fit(x_train,y_train)\n",
" \n",
" # Use the trained model to predict on the test data\n",
" y_pred = model.predict(___)\n",
" \n",
" # Calculate the MSE of the test data predictions\n",
" MSE = ____\n",
"\n",
" # Store the MSE values of each k value in the dictionary\n",
" knn_dict[k_value] = ___\n",
" \n",
" \n",
" # Helper code to plot the data and various kNN model predictions\n",
" colors = ['grey','r','b']\n",
" if k_value in [1,10,70]:\n",
" xvals = np.linspace(x.min(),x.max(),100)\n",
" ypreds = model.predict(xvals)\n",
" ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])\n",
" j+=1\n",
" \n",
"ax.legend(loc='lower right',fontsize=20)\n",
"ax.plot(x_train, y_train,'x',label='test',color='k')\n",
"ax.set_xlabel('TV budget in $1000',fontsize=20)\n",
"ax.set_ylabel('Sales in $1000',fontsize=20)\n",
"plt.tight_layout()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graph plot"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Plot a graph which depicts the relation between the k values and MSE\n",
"plt.figure(figsize=(8,6))\n",
"plt.plot(___, ___,'k.-',alpha=0.5,linewidth=2)\n",
"\n",
"# Set the title and axis labels\n",
"plt.xlabel('k',fontsize=20)\n",
"plt.ylabel('MSE',fontsize = 20)\n",
"plt.title('Test $MSE$ values for different k values - KNN regression',fontsize=20)\n",
"plt.tight_layout()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Find the best knn model"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mse) ###\n",
"\n",
"# Find the lowest MSE among all the kNN models\n",
"min_mse = min(___)\n",
"\n",
"# Use list comprehensions to find the k value associated with the lowest MSE\n",
"best_model = [key for (key, value) in knn_dict.items() if value == min_mse]\n",
"\n",
"# Print the best k-value\n",
"print (\"The best k value is \",best_model,\"with a MSE of \", min_mse)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ⏸ From the options below, how would you classify the \"goodness\" of your model?\n",
"\n",
"#### A. Good\n",
"#### B. Satisfactory\n",
"#### C. Bad"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow1) ###\n",
"# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
"answer1 = '___'\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to compute the R2_score of your best model\n",
"model = KNeighborsRegressor(n_neighbors=best_model[0])\n",
"model.fit(x_train,y_train)\n",
"y_pred_test = model.predict(x_test)\n",
"\n",
"# Print the R2 score of the model\n",
"print(f\"The R2 score for your model is {r2_score(y_test, y_pred_test)}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ⏸ After observing the $R^2$ value, how would you now classify your model?\n",
"\n",
"#### A. Good\n",
"#### B. Satisfactory\n",
"#### C. Bad\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow2) ###\n",
"# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
"answer2 = '___'\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}