{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Finding the Best k in kNN Regression\n", "\n", "## Description :\n", "The goal here is to find the value of k of the best performing model based on the test MSE.\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "- Read the data into a Pandas dataframe object. \n", "- Select the sales column as the response variable and TV budget column as the predictor variable.\n", "- Make a train-test split using sklearn.model_selection.train_test_split .\n", "- Create a list of integer k values using numpy.linspace .\n", "- For each value of k\n", " - Fit a kNN regression on train set.\n", " - Calculate MSE on test set and store it.\n", "- Plot the test MSE values for each k.\n", "- Find the k value associated with the lowest test MSE.\n", "\n", "\n", "## Hints: \n", "\n", "train_test_split(X,y)\n", "Split arrays or matrices into random train and test subsets. \n", "\n", "np.linspace()\n", "Returns evenly spaced numbers over a specified interval.\n", "\n", "KNeighborsRegressor(n_neighbors=k_value)\n", "Regression-based on k-nearest neighbors. \n", "\n", "model.predict()\n", "Predict the target for the provided data.\n", "\n", "mean_squared_error()\n", "Computes the mean squared error regression loss.\n", "\n", "dict.keys()\n", "Returns a view object that displays a list of all the keys in the dictionary.\n", "\n", "dict.values()\n", "Returns a list of all the values available in a given dictionary.\n", "\n", "plt.plot()\n", "Plot y versus x as lines and/or markers.\n", "\n", "dict.items()\n", "Returns a list of dict's (key, value) tuple pairs.\n", "\n", "\n", "**Note:** This exercise is auto-graded and you can try multiple attempts. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import numpy as np\n", "import pandas as pd \n", "import matplotlib.pyplot as plt\n", "from sklearn.utils import shuffle\n", "from sklearn.metrics import r2_score\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.model_selection import train_test_split\n", "%matplotlib inline\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the standard Advertising dataset" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the file 'Advertising.csv' into a Pandas dataset\n", "df = pd.read_csv('Advertising.csv')\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Take a quick look at the data\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Set the 'TV' column as predictor variable\n", "x = df[[___]]\n", "\n", "# Set the 'Sales' column as response variable \n", "y = df[___]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train-Test split" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "### edTest(test_shape) ###\n", "# Split the dataset in training and testing with 60% training set and \n", "# 40% testing set \n", "x_train, x_test, y_train, y_test = train_test_split(___,___,train_size=___,random_state=66)\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "### edTest(test_nums) ###\n", "# Choosing k range from 1 to 70\n", "k_value_min = 1\n", "k_value_max = 70\n", "\n", "# Create a list of integer k values between k_value_min and \n", "# k_value_max using linspace\n", "k_list = np.linspace(k_value_min,k_value_max,num=70,dtype=int)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model fit" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Setup a grid for plotting the data and predictions\n", "fig, ax = plt.subplots(figsize=(10,6))\n", "\n", "# Create a dictionary to store the k value against MSE fit {k: MSE@k} \n", "knn_dict = {}\n", "\n", "# Variable used for altering the linewidth of values kNN models\n", "j=0\n", "\n", "# Loop over all k values\n", "for k_value in k_list: \n", " \n", " # Create a KNN Regression model for the current k\n", " model = KNeighborsRegressor(n_neighbors=int(___))\n", " \n", " # Fit the model on the train data\n", " model.fit(x_train,y_train)\n", " \n", " # Use the trained model to predict on the test data\n", " y_pred = model.predict(___)\n", " \n", " # Calculate the MSE of the test data predictions\n", " MSE = ____\n", "\n", " # Store the MSE values of each k value in the dictionary\n", " knn_dict[k_value] = ___\n", " \n", " \n", " # Helper code to plot the data and various kNN model predictions\n", " colors = ['grey','r','b']\n", " if k_value in [1,10,70]:\n", " xvals = np.linspace(x.min(),x.max(),100)\n", " ypreds = model.predict(xvals)\n", " ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])\n", " j+=1\n", " \n", "ax.legend(loc='lower right',fontsize=20)\n", "ax.plot(x_train, y_train,'x',label='test',color='k')\n", "ax.set_xlabel('TV budget in $1000',fontsize=20)\n", "ax.set_ylabel('Sales in $1000',fontsize=20)\n", "plt.tight_layout()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph plot" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Plot a graph which depicts the relation between the k values and MSE\n", "plt.figure(figsize=(8,6))\n", "plt.plot(___, ___,'k.-',alpha=0.5,linewidth=2)\n", "\n", "# Set the title and axis labels\n", "plt.xlabel('k',fontsize=20)\n", "plt.ylabel('MSE',fontsize = 20)\n", "plt.title('Test $MSE$ values for different k values - KNN regression',fontsize=20)\n", "plt.tight_layout()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find the best knn model" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mse) ###\n", "\n", "# Find the lowest MSE among all the kNN models\n", "min_mse = min(___)\n", "\n", "# Use list comprehensions to find the k value associated with the lowest MSE\n", "best_model = [key for (key, value) in knn_dict.items() if value == min_mse]\n", "\n", "# Print the best k-value\n", "print (\"The best k value is \",best_model,\"with a MSE of \", min_mse)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ⏸ From the options below, how would you classify the \"goodness\" of your model?\n", "\n", "#### A. Good\n", "#### B. Satisfactory\n", "#### C. Bad" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n", "answer1 = '___'\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to compute the R2_score of your best model\n", "model = KNeighborsRegressor(n_neighbors=best_model[0])\n", "model.fit(x_train,y_train)\n", "y_pred_test = model.predict(x_test)\n", "\n", "# Print the R2 score of the model\n", "print(f\"The R2 score for your model is {r2_score(y_test, y_pred_test)}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ⏸ After observing the $R^2$ value, how would you now classify your model?\n", "\n", "#### A. Good\n", "#### B. Satisfactory\n", "#### C. Bad\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow2) ###\n", "# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n", "answer2 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }