{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Simple kNN Regression\n", "\n", "## Description :\n", "The goal of this exercise is to **re-create the plots** given below. You would have come across these graphs in the lecture as well.\n", "\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "Part 1: KNN by hand for k=1\n", "- Read the Advertisement data.\n", "- Get a subset of the data from row 5 to row 13.\n", "- Apply the kNN algorithm by hand and plot the first graph as given above.\n", "\n", "Part 2: Using sklearn package\n", "- Read the Advertisement dataset.\n", "- Split the data into train and test sets using the `train_test_split()` function.\n", "- Set `k_list` as the possible k values ranging from 1 to 70.\n", "- For each value of `k` in `k_list`:\n", " - Use `sklearn KNearestNeighbors()` to fit train data.\n", " - Predict on the test data.\n", " - Use the helper code to get the second plot above for k=1,10,70.\n", "\n", "\n", "\n", "## Hints: \n", "\n", "np.argsort()\n", "Returns the indices that would sort an array. \n", "\n", "df.iloc[]\n", "Returns a subset of the dataframe that is contained in the column range passed as the argument.\n", "\n", "plt.plot()\n", "Plot y versus x as lines and/or markers.\n", "\n", "df.values\n", "Returns a Numpy representation of the DataFrame.\n", "\n", "pd.idxmin()\n", "Returns index of the first occurrence of minimum over requested axis.\n", "\n", "np.min()\n", "Returns the minimum along a given axis.\n", "\n", "np.max()\n", "Returns the maximum along a given axis.\n", "\n", "model.fit()\n", "Fit the k-nearest neighbors regressor from the training dataset.\n", "\n", "model.predict()\n", "Predict the target for the provided data.\n", "\n", "np.zeros()\n", "Returns a new array of given shape and type, filled with zeros.\n", "\n", "train_test_split(X,y)\n", "Split arrays or matrices into random train and test subsets. \n", "\n", "np.linspace()\n", "Returns evenly spaced numbers over a specified interval.\n", "\n", "KNeighborsRegressor(n_neighbors=k_value)\n", "Regression-based on k-nearest neighbors. \n", "\n", "**Note:** This exercise is auto-graded, hence please remember to set all the parameters to the values mentioned in the scaffold before marking." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import required libraries\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.model_selection import train_test_split\n", "%matplotlib inline\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Read the data from the file \"Advertising.csv\"\n", "filename = 'Advertising.csv'\n", "df_adv = pd.read_csv(filename)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TVRadioNewspaperSales
0230.137.869.222.1
144.539.345.110.4
217.245.969.39.3
3151.541.358.518.5
4180.810.858.412.9
\n", "
" ], "text/plain": [ " TV Radio Newspaper Sales\n", "0 230.1 37.8 69.2 22.1\n", "1 44.5 39.3 45.1 10.4\n", "2 17.2 45.9 69.3 9.3\n", "3 151.5 41.3 58.5 18.5\n", "4 180.8 10.8 58.4 12.9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Take a quick look of the dataset\n", "df_adv.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: KNN by hand for $k=1$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Get a subset of the data i.e. rows 5 to 13\n", "# Use the TV column as the predictor\n", "x_true = df_adv.TV.iloc[5:13]\n", "\n", "# Use the Sales column as the response\n", "y_true = df_adv.Sales.iloc[5:13]\n", "\n", "# Sort the data to get indices ordered from lowest to highest TV values\n", "idx = np.argsort(x_true).values \n", "\n", "# Get the predictor data in the order given by idx above\n", "x_true = x_true.iloc[idx].values\n", "\n", "# Get the response data in the order given by idx above\n", "y_true = y_true.iloc[idx].values\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "### edTest(test_findnearest) ###\n", "# Define a function that finds the index of the nearest neighbor \n", "# and returns the value of the nearest neighbor. \n", "# Note that this is just for k = 1 where the distance function is \n", "# simply the absolute value.\n", "\n", "def find_nearest(array,value):\n", " \n", " # Hint: To find idx, use .idxmin() function on the series\n", " idx = pd.Series(np.abs(array-value)).idxmin() \n", "\n", " # Return the nearest neighbor index and value\n", " return idx, array[idx]\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Create some synthetic x-values (might not be in the actual dataset)\n", "x = np.linspace(np.min(x_true), np.max(x_true))\n", "\n", "# Initialize the y-values for the length of the synthetic x-values to zero\n", "y = np.zeros((len(x)))\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Apply the KNN algorithm to predict the y-value for the given x value\n", "for i, xi in enumerate(x):\n", "\n", " # Get the Sales values closest to the given x value\n", " y[i] = y[find_nearest(x, xi )[0]]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting the data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAO2UlEQVR4nO3cf6zdd13H8efLlprwI9mWdaO0ZS2zUaqS0d00M4sEHYN2EopEYpcwFvxRZlYDCWZ08If+JVMjGOLcLLKwCVqnjNBoZYxKgiYOezvHRil1l8poaV0vv5ElNoW3f9xv5fZybu+5PWc79/h5PpLm3u/n+/ne89kn3/TZ8733LlWFJKldPzbqBUiSRssQSFLjDIEkNc4QSFLjDIEkNW75qBdwIS699NJat27dqJchSWPl4MGDX6uqlXPHxzIE69atY3JyctTLkKSxkuTJXuM+GpKkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxhkCSWqcIZCkxg0lBEm2JDmSZCrJrh7nk+T93fnHkmyac35Zkn9P8vfDWI8kqX8DhyDJMuBOYCuwEbgxycY507YCG7o/O4C75px/G3B40LVIkhZvGO8INgNTVXW0qk4De4Btc+ZsA+6rGQ8DFyVZBZBkDfBLwF8MYS2SpEUaRghWA8dmHR/vxvqd8yfAbcAPzvciSXYkmUwyOT09PdCCJUk/NIwQpMdY9TMnyWuBU1V1cKEXqardVTVRVRMrV668kHVKknoYRgiOA2tnHa8BTvQ551rgdUm+zMwjpV9M8uEhrEmS1KdhhOAAsCHJ+iQrgO3A3jlz9gJv7n566Brg21V1sqpur6o1VbWuu+6fqupNQ1iTJKlPywf9AlV1JslO4EFgGXBPVR1Kckt3/m5gH3ADMAU8Dbxl0NeVJA1HquY+zl/6JiYmanJyctTLkKSxkuRgVU3MHfc3iyWpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkho3lBAk2ZLkSJKpJLt6nE+S93fnH0uyqRtfm+TTSQ4nOZTkbcNYjySpfwOHIMky4E5gK7ARuDHJxjnTtgIbuj87gLu68TPAO6rqpcA1wK09rpUkPYOG8Y5gMzBVVUer6jSwB9g2Z8424L6a8TBwUZJVVXWyqh4BqKrvAoeB1UNYkySpT8MIwWrg2Kzj4/zoX+YLzkmyDng58NkhrEmS1KdhhCA9xmoxc5I8H/go8Paq+k7PF0l2JJlMMjk9PX3Bi5UknWsYITgOrJ11vAY40e+cJM9hJgIfqaoH5nuRqtpdVRNVNbFy5cohLFuSBMMJwQFgQ5L1SVYA24G9c+bsBd7c/fTQNcC3q+pkkgAfBA5X1XuHsBZJ0iItH/QLVNWZJDuBB4FlwD1VdSjJLd35u4F9wA3AFPA08Jbu8muBm4DHkzzajb2rqvYNui5JUn9SNfdx/tI3MTFRk5OTo16GJI2VJAeramLuuL9ZLEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNG0oIkmxJciTJVJJdPc4nyfu7848l2dTvtZKkZ9bAIUiyDLgT2ApsBG5MsnHOtK3Ahu7PDuCuRVwrSXoGLR/C19gMTFXVUYAke4BtwBdmzdkG3FdVBTyc5KIkq4B1fVw7VL/65/+64JzrXnoZO15x5f/N/5Wr1/DGibV843un+a0PH1zw+rnzf/PnX8KrNl7Ol6b/m3c98PiC18+df9uWn+TqKy7h4JPf4A8/cWTB6+fO//03/CxXrnw+n/rCU3zgn48ueP3c+Xe96Woued4K/nbyGH938PiC18+d/zdv/TkAdn/mS+w/fGrB62fPf+TJb3H3TVcD8Aef+CKPPPnN81578XNXnDP/W0+f5j1veBkAtz/wGEenv3fe61+y8nnnzL/ouSt455afAuCWvzzIN58+fd7rN11x8TnzN11x0Tn30kK897z3zs6f7947u8ZhGsajodXAsVnHx7uxfub0cy0ASXYkmUwyOT09PfCiJUkzMvOP9AG+QPJG4DVV9Rvd8U3A5qr67Vlz/gF4T1X9S3e8H7gNeMlC1/YyMTFRk5OTA61bklqT5GBVTcwdH8ajoePA2lnHa4ATfc5Z0ce1kqRn0DAeDR0ANiRZn2QFsB3YO2fOXuDN3U8PXQN8u6pO9nmtJOkZNPA7gqo6k2Qn8CCwDLinqg4luaU7fzewD7gBmAKeBt5yvmsHXZMkqX8Df49gFPwegSQt3nzfI/A3iyWpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkhpnCCSpcYZAkho3UAiSXJLkoSRPdB8vnmfeliRHkkwl2TVr/I+SfDHJY0k+luSiQdYjSVq8Qd8R7AL2V9UGYH93fI4ky4A7ga3ARuDGJBu70w8BP1NVLwP+A7h9wPVIkhZp0BBsA+7tPr8XeH2POZuBqao6WlWngT3ddVTVJ6vqTDfvYWDNgOuRJC3SoCG4vKpOAnQfL+sxZzVwbNbx8W5srl8D/nHA9UiSFmn5QhOSfAp4YY9T7+7zNdJjrOa8xruBM8BHzrOOHcAOgBe/+MV9vrQkaSELhqCqXjXfuSRPJVlVVSeTrAJO9Zh2HFg763gNcGLW17gZeC1wXVUV86iq3cBugImJiXnnSZIWZ9BHQ3uBm7vPbwY+3mPOAWBDkvVJVgDbu+tIsgV4J/C6qnp6wLVIki7AoCG4A7g+yRPA9d0xSV6UZB9A983gncCDwGHg/qo61F3/p8ALgIeSPJrk7gHXI0lapAUfDZ1PVX0duK7H+AnghlnH+4B9Peb9xCCvL0kanL9ZLEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNMwSS1DhDIEmNGygESS5J8lCSJ7qPF88zb0uSI0mmkuzqcf53klSSSwdZjyRp8QZ9R7AL2F9VG4D93fE5kiwD7gS2AhuBG5NsnHV+LXA98JUB1yJJugCDhmAbcG/3+b3A63vM2QxMVdXRqjoN7OmuO+t9wG1ADbgWSdIFGDQEl1fVSYDu42U95qwGjs06Pt6NkeR1wFer6nMLvVCSHUkmk0xOT08PuGxJ0lnLF5qQ5FPAC3ucenefr5EeY5Xkud3XeHU/X6SqdgO7ASYmJnz3IElDsmAIqupV851L8lSSVVV1Mskq4FSPaceBtbOO1wAngCuB9cDnkpwdfyTJ5qr6r0X8N0iSBjDoo6G9wM3d5zcDH+8x5wCwIcn6JCuA7cDeqnq8qi6rqnVVtY6ZYGwyApL07Bo0BHcA1yd5gpmf/LkDIMmLkuwDqKozwE7gQeAwcH9VHRrwdSVJQ7Lgo6HzqaqvA9f1GD8B3DDreB+wb4GvtW6QtUiSLoy/WSxJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktQ4QyBJjTMEktS4VNWo17BoSaaBJ4FLga+NeDnjwH3qj/vUH/epP0txn66oqpVzB8cyBGclmayqiVGvY6lzn/rjPvXHferPOO2Tj4YkqXGGQJIaN+4h2D3qBYwJ96k/7lN/3Kf+jM0+jfX3CCRJgxv3dwSSpAEZAklq3NiGIMmWJEeSTCXZNer1LCVJvpzk8SSPJpnsxi5J8lCSJ7qPF496nc+2JPckOZXk87PG5t2XJLd399eRJK8ZzaqfffPs0+8l+Wp3Tz2a5IZZ55rbpyRrk3w6yeEkh5K8rRsfy/tpLEOQZBlwJ7AV2AjcmGTjaFe15PxCVV016+eYdwH7q2oDsL87bs2HgC1zxnruS3c/bQd+urvmz7r7rgUf4kf3CeB93T11VVXtg6b36Qzwjqp6KXANcGu3F2N5P41lCIDNwFRVHa2q08AeYNuI17TUbQPu7T6/F3j96JYyGlX1GeAbc4bn25dtwJ6q+p+q+k9gipn77v+9efZpPk3uU1WdrKpHus+/CxwGVjOm99O4hmA1cGzW8fFuTDMK+GSSg0l2dGOXV9VJmLmJgctGtrqlZb598R77UTuTPNY9Ojr7yKP5fUqyDng58FnG9H4a1xCkx5g/B/tD11bVJmYend2a5BWjXtAY8h47113AlcBVwEngj7vxpvcpyfOBjwJvr6rvnG9qj7Els0/jGoLjwNpZx2uAEyNay5JTVSe6j6eAjzHzFvSpJKsAuo+nRrfCJWW+ffEem6Wqnqqq71fVD4AP8MPHGs3uU5LnMBOBj1TVA93wWN5P4xqCA8CGJOuTrGDmmzB7R7ymJSHJ85K84OznwKuBzzOzPzd3024GPj6aFS458+3LXmB7kh9Psh7YAPzbCNa3JJz9y63zy8zcU9DoPiUJ8EHgcFW9d9apsbyflo96AReiqs4k2Qk8CCwD7qmqQyNe1lJxOfCxmfuU5cBfVdUnkhwA7k/y68BXgDeOcI0jkeSvgVcClyY5DvwucAc99qWqDiW5H/gCMz8hcmtVfX8kC3+WzbNPr0xyFTOPM74MvBWa3qdrgZuAx5M82o29izG9n/xfTEhS48b10ZAkaUgMgSQ1zhBIUuMMgSQ1zhBIUuMMgSQ1zhBIUuP+FyO7y3VFPaQqAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "ename": "NameError", "evalue": "name 'Y_true' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_30/3566749308.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;31m# Plot the original data using black x's.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_true\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mY_true\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'kx'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;31m# Set the title and axis labels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'Y_true' is not defined" ] } ], "source": [ "# Plot the synthetic data along with the predictions \n", "plt.plot(x, y, '-.')\n", "\n", "# Plot the original data using black x's.\n", "plt.plot(x_true, Y_true, 'kx')\n", "\n", "# Set the title and axis labels\n", "plt.title('TV vs Sales')\n", "plt.xlabel('TV budget in $1000')\n", "plt.ylabel('Sales in $1000')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: KNN for $k\\ge1$ using sklearn" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the data from the file \"Advertising.csv\"\n", "data_filename = 'Advertising.csv'\n", "df = pd.read_csv(data_filename)\n", "\n", "# Set 'TV' as the 'predictor variable' \n", "x = df[[___]]\n", "\n", "# Set 'Sales' as the response variable 'y' \n", "y = df[___]\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_shape) ###\n", "\n", "# Split the dataset in training and testing with 60% training set \n", "# and 40% testing set with random state = 42\n", "x_train, x_test, y_train, y_test = train_test_split(___, ___, train_size=___,random_state=___)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_nums) ###\n", "\n", "# Choose the minimum k value based on the instructions given on the left\n", "k_value_min = ___\n", "\n", "# Choose the maximum k value based on the instructions given on the left\n", "k_value_max = ___\n", "\n", "\n", "# Create a list of integer k values betwwen k_value_min and k_value_max using linspace\n", "k_list = np.linspace(k_value_min, k_value_max, 70)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Set the grid to plot the values\n", "fig, ax = plt.subplots(figsize=(10,6))\n", "\n", "# Variable used to alter the linewidth of each plot\n", "j=0\n", "\n", "# Loop over all the k values\n", "for k_value in k_list: \n", " \n", " # Creating a kNN Regression model \n", " model = KNeighborsRegressor(n_neighbors=int(___))\n", " \n", " # Fitting the regression model on the training data \n", " model.fit(___,___)\n", " \n", " # Use the trained model to predict on the test data \n", " y_pred = model.predict(___)\n", " \n", " # Helper code to plot the data along with the model predictions\n", " colors = ['grey','r','b']\n", " if k_value in [1,10,70]:\n", " xvals = np.linspace(x.min(),x.max(),100)\n", " ypreds = model.predict(xvals)\n", " ax.plot(xvals, ypreds,'-',label = f'k = {int(k_value)}',linewidth=j+2,color = colors[j])\n", " j+=1\n", " \n", "ax.legend(loc='lower right',fontsize=20)\n", "ax.plot(x_train, y_train,'x',label='train',color='k')\n", "ax.set_xlabel('TV budget in $1000',fontsize=20)\n", "ax.set_ylabel('Sales in $1000',fontsize=20)\n", "plt.tight_layout()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ⏸ In the plotting code above, re-run `ax.plot(x_train, y_train,'x',label='train',color='k')` with `x_test` and `y_test` instead. According to you, which k value is the best and why?" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Type your answer within in the quotes given\n", "answer1 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }