{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#
CS109A Introduction to Data Science \n",
"\n",
"\n",
"## Lab 3: plotting, K-NN Regression, Simple Linear Regression\n",
"\n",
"**Harvard University**
\n",
"**Fall 2019**
\n",
"**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n",
"\n",
"**Material prepared by**: David Sondak, Will Claybaugh, Pavlos Protopapas, and Eleni Kaxiras."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extended Edition\n",
"\n",
"Same as the one done in class with the following additions/clarifications:\n",
"\n",
"* I added another example to illustrate the difference between `.iloc` and `.loc` in `pandas` -- > [here](#iloc)\n",
"* I added some notes on why we are adding a constant in our linear regression model --> [here](#constant)\n",
"* How to run the solutions: Uncomment the following line and run the cell:\n",
"\n",
"```python\n",
"# %load solutions/knn_regression.py\n",
"```\n",
"This will bring up the code in the cell but WILL NOT RUN it. You need to run the cell again in order to actually run the code\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#RUN THIS CELL \n",
"import requests\n",
"from IPython.core.display import HTML\n",
"styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n",
"HTML(styles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learning Goals\n",
"\n",
"By the end of this lab, you should be able to:\n",
"* Review `numpy` including 2-D arrays and understand array reshaping\n",
"* Use `matplotlib` to make plots\n",
"* Feel comfortable with simple linear regression\n",
"* Feel comfortable with $k$ nearest neighbors\n",
"\n",
"**This lab corresponds to lectures 4 and 5 and maps on to homework 2 and beyond.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"#### HIGHLIGHTS FROM PRE-LAB \n",
"\n",
"* [1 - Review of numpy](#first-bullet)\n",
"* [2 - Intro to matplotlib plus more ](#second-bullet)\n",
"\n",
"#### LAB 3 MATERIAL \n",
"\n",
"* [3 - Simple Linear Regression](#third-bullet)\n",
"* [4 - Building a model with `statsmodels` and `sklearn`](#fourth-bullet)\n",
"* [5 - Example: Simple linear regression with automobile data](#fifth-bullet)\n",
"* [6 - $k$Nearest Neighbors](#sixth-bullet)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import scipy as sp\n",
"import matplotlib as mpl\n",
"import matplotlib.cm as cm\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import time\n",
"pd.set_option('display.width', 500)\n",
"pd.set_option('display.max_columns', 100)\n",
"pd.set_option('display.notebook_repr_html', True)\n",
"#import seaborn as sns\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"# Displays the plots for us.\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use this as a variable to load solutions: %load PATHTOSOLUTIONS/exercise1.py. It will be substituted in the code\n",
"# so do not worry if it disappears after you run the cell.\n",
"PATHTOSOLUTIONS = 'solutions'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 1 - Review of the `numpy` Python library\n",
"\n",
"In lab1 we learned about the `numpy` library [(documentation)](http://www.numpy.org/) and its fast array structure, called the `numpy array`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import numpy\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make an array\n",
"my_array = np.array([1,4,9,16])\n",
"my_array"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f'Size of my array: {my_array.size}, or length of my array: {len(my_array)}')\n",
"print (f'Shape of my array: {my_array.shape}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Notice the way the shape appears in numpy arrays\n",
"\n",
"- For a 1D array, .shape returns a tuple with 1 element (n,)\n",
"- For a 2D array, .shape returns a tuple with 2 elements (n,m)\n",
"- For a 3D array, .shape returns a tuple with 3 elements (n,m,p)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# How to reshape a 1D array to a 2D\n",
"my_array.reshape(-1,2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy arrays support the same operations as lists! Below we slice and iterate. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"array[2:4]:\", my_array[2:4]) # A slice of the array\n",
"\n",
"# Iterate over the array\n",
"for ele in my_array:\n",
" print(\"element:\", ele)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember `numpy` gains a lot of its efficiency from being **strongly typed** (all elements are of the same type, such as integer or floating point). If the elements of an array are of a different type, `numpy` will force them into the same type (the longest in terms of bytes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mixed = np.array([1, 2.3, 'eleni', True])\n",
"print(type(1), type(2.3), type('eleni'), type(True))\n",
"mixed # all elements will become strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we push ahead to two-dimensional arrays and begin to dive into some of the deeper aspects of `numpy`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# create a 2d-array by handing a list of lists\n",
"my_array2d = np.array([ [1, 2, 3, 4], \n",
" [5, 6, 7, 8], \n",
" [9, 10, 11, 12] \n",
"])\n",
"\n",
"my_array2d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Array Slicing (a reminder...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy arrays can be sliced, and can be iterated over with loops. Below is a schematic illustrating slicing two-dimensional arrays. \n",
"\n",
"
\n",
" \n",
"Notice that the list slicing syntax still works! \n",
"`array[2:,3]` says \"in the array, get rows 2 through the end, column 3]\" \n",
"`array[3,:]` says \"in the array, get row 3, all columns\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Pandas Slicing (a reminder...)\n",
"\n",
"`.iloc` is by position (position is unique), `.loc` is by label (label is not unique)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import cast dataframe \n",
"cast = pd.read_csv('../data/cast.csv', encoding='utf_8')\n",
"cast.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get me rows 10 to 13 (python slicing style : exclusive of end) \n",
"cast.iloc[10:13]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get me columns 0 to 2 but all rows - use head()\n",
"cast.iloc[:, 0:2].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get me rows 10 to 13 AND only columns 0 to 2\n",
"cast.iloc[10:13, 0:2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# COMPARE: get me rows 10 to 13 (pandas slicing style : inclusive of end)\n",
"cast.loc[10:13]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# give me columns 'year' and 'type' by label but only for rows 5 to 10\n",
"cast.loc[5:10,['year','type']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Another example of positioning with `.iloc` and `loc`\n",
"\n",
"Look at the following data frame. It is a bad example because we have duplicate values for the index but that is legal in pandas. It's just a bad practice and we are doing it to illustrate the difference between positioning with `.iloc` and `loc`. To keep rows unique, though, internally, `pandas` has its own index which in this dataframe runs from `0` to `2`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"index = ['A', 'Z', 'A']\n",
"famous = pd.DataFrame({'Elton': ['singer', 'Candle in the wind', 'male'],\n",
" 'Maraie': ['actress' , 'Do not know', 'female'],\n",
" 'num': np.random.randn(3)}, index=index)\n",
"famous"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# accessing elements by label can bring up duplicates!!\n",
"famous.loc['A'] # since we want all rows is the same as famous.loc['A',:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# accessing elements by position is unique - brings up only one row\n",
"famous.iloc[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 2 - Plotting with matplotlib and beyond\n",
"
\n",
"
\n",
"\n",
"`matplotlib` is a very powerful `python` library for making scientific plots. \n",
"\n",
"We will not focus too much on the internal aspects of `matplotlib` in today's lab. There are many excellent tutorials out there for `matplotlib`. For example,\n",
"* [`matplotlib` homepage](https://matplotlib.org/)\n",
"* [`matplotlib` tutorial](https://github.com/matplotlib/AnatomyOfMatplotlib)\n",
"\n",
"Conveying your findings convincingly is an absolutely crucial part of any analysis. Therefore, you must be able to write well and make compelling visuals. Creating informative visuals is an involved process and we won't cover that in this lab. However, part of creating informative data visualizations means generating *readable* figures. If people can't read your figures or have a difficult time interpreting them, they won't understand the results of your work. Here are some non-negotiable commandments for any plot:\n",
"* Label $x$ and $y$ axes\n",
"* Axes labels should be informative\n",
"* Axes labels should be large enough to read\n",
"* Make tick labels large enough\n",
"* Include a legend if necessary\n",
"* Include a title if necessary\n",
"* Use appropriate line widths\n",
"* Use different line styles for different lines on the plot\n",
"* Use different markers for different lines\n",
"\n",
"There are other important elements, but that list should get you started on your way.\n",
"\n",
"We will work with `matplotlib` and `seaborn` for plotting in this class. `matplotlib` is a very powerful `python` library for making scientific plots. `seaborn` is a little more specialized in that it was developed for statistical data visualization. We will cover some `seaborn` later in class. In the meantime you can look at the [seaborn documentation](https://seaborn.pydata.org)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's generate some data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's plot some functions\n",
"\n",
"We will use the following three functions to make some plots:\n",
"\n",
"* Logistic function:\n",
" \\begin{align*}\n",
" f\\left(z\\right) = \\dfrac{1}{1 + be^{-az}}\n",
" \\end{align*}\n",
" where $a$ and $b$ are parameters.\n",
"* Hyperbolic tangent:\n",
" \\begin{align*}\n",
" g\\left(z\\right) = b\\tanh\\left(az\\right) + c\n",
" \\end{align*}\n",
" where $a$, $b$, and $c$ are parameters.\n",
"* Rectified Linear Unit:\n",
" \\begin{align*}\n",
" h\\left(z\\right) = \n",
" \\left\\{\n",
" \\begin{array}{lr}\n",
" z, \\quad z > 0 \\\\\n",
" \\epsilon z, \\quad z\\leq 0\n",
" \\end{array}\n",
" \\right.\n",
" \\end{align*}\n",
" where $\\epsilon < 0$ is a small, positive parameter.\n",
"\n",
"You are given the code for the first two functions. Notice that $z$ is passed in as a `numpy` array and that the functions are returned as `numpy` arrays. Parameters are passed in as floats.\n",
"\n",
"You should write a function to compute the rectified linear unit. The input should be a `numpy` array for $z$ and a positive float for $\\epsilon$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def logistic(z: np.ndarray, a: float, b: float) -> np.ndarray:\n",
" \"\"\" Compute logistic function\n",
" Inputs:\n",
" a: exponential parameter\n",
" b: exponential prefactor\n",
" z: numpy array; domain\n",
" Outputs:\n",
" f: numpy array of floats, logistic function\n",
" \"\"\"\n",
" \n",
" den = 1.0 + b * np.exp(-a * z)\n",
" return 1.0 / den\n",
"\n",
"def stretch_tanh(z: np.ndarray, a: float, b: float, c: float) -> np.ndarray:\n",
" \"\"\" Compute stretched hyperbolic tangent\n",
" Inputs:\n",
" a: horizontal stretch parameter (a>1 implies a horizontal squish)\n",
" b: vertical stretch parameter\n",
" c: vertical shift parameter\n",
" z: numpy array; domain\n",
" Outputs:\n",
" g: numpy array of floats, stretched tanh\n",
" \"\"\"\n",
" return b * np.tanh(a * z) + c\n",
"\n",
"def relu(z: np.ndarray, eps: float = 0.01) -> np.ndarray:\n",
" \"\"\" Compute rectificed linear unit\n",
" Inputs:\n",
" eps: small positive parameter\n",
" z: numpy array; domain\n",
" Outputs:\n",
" h: numpy array; relu\n",
" \"\"\"\n",
" return np.fmax(z, eps * z)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's make some plots. First, let's just warm up and plot the logistic function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(-5.0, 5.0, 100) # Equally spaced grid of 100 pts between -5 and 5\n",
"\n",
"f = logistic(x, 1.0, 1.0) # Generate data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.plot(x, f)\n",
"plt.xlabel('x')\n",
"plt.ylabel('f')\n",
"plt.title('Logistic Function')\n",
"plt.grid(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Figures with subplots\n",
"\n",
"Let's start thinking about the plots as objects. We have the `figure` object which is like a matrix of smaller plots named `axes`. You can use array notation when handling it. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1,1) # Get figure and axes objects\n",
"\n",
"ax.plot(x, f) # Make a plot\n",
"\n",
"# Create some labels\n",
"ax.set_xlabel('x')\n",
"ax.set_ylabel('f')\n",
"ax.set_title('Logistic Function')\n",
"\n",
"# Grid\n",
"ax.grid(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wow, it's *exactly* the same plot! Notice, however, the use of `ax.set_xlabel()` instead of `plt.xlabel()`. The difference is tiny, but you should be aware of it. I will use this plotting syntax from now on.\n",
"\n",
"What else do we need to do to make this figure better? Here are some options:\n",
"* Make labels bigger!\n",
"* Make line fatter\n",
"* Make tick mark labels bigger\n",
"* Make the grid less pronounced\n",
"* Make figure bigger\n",
"\n",
"Let's get to it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1,1, figsize=(10,6)) # Make figure bigger\n",
"\n",
"# Make line plot\n",
"ax.plot(x, f, lw=4)\n",
"\n",
"# Update ticklabel size\n",
"ax.tick_params(labelsize=24)\n",
"\n",
"# Make labels\n",
"ax.set_xlabel(r'$x$', fontsize=24) # Use TeX for mathematical rendering\n",
"ax.set_ylabel(r'$f(x)$', fontsize=24) # Use TeX for mathematical rendering\n",
"ax.set_title('Logistic Function', fontsize=24)\n",
"\n",
"ax.grid(True, lw=1.5, ls='--', alpha=0.75)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice:\n",
"* `lw` stands for `linewidth`. We could also write `ax.plot(x, f, linewidth=4)`\n",
"* `ls` stands for `linestyle`.\n",
"* `alpha` stands for transparency."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only thing remaining to do is to change the $x$ limits. Clearly these should go from $-5$ to $5$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#fig.savefig('logistic.png')\n",
"\n",
"# Put this in a markdown cell and uncomment this to check what you saved.\n",
"# "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Resources\n",
"If you want to see all the styles available, please take a look at the documentation.\n",
"* [Line styles](https://matplotlib.org/2.0.1/api/lines_api.html#matplotlib.lines.Line2D.set_linestyle)\n",
"* [Marker styles](https://matplotlib.org/2.0.1/api/markers_api.html#module-matplotlib.markers)\n",
"* [Everything you could ever want](https://matplotlib.org/2.0.1/api/lines_api.html#matplotlib.lines.Line2D.set_marker)\n",
"\n",
"We haven't discussed it yet, but you can also put a legend on a figure. You'll do that in the next exercise. Here are some additional resources:\n",
"* [Legend](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html)\n",
"* [Grid](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.grid.html)\n",
"\n",
"`ax.legend(loc='best', fontsize=24);`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"Do the following:\n",
"* Make a figure with the logistic function, hyperbolic tangent, and rectified linear unit.\n",
"* Use different line styles for each plot\n",
"* Put a legend on your figure\n",
"\n",
"Here's an example of a figure:\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# your code here\n",
"\n",
"# First get the data\n",
"f = logistic(x, 2.0, 1.0)\n",
"g = stretch_tanh(x, 2.0, 0.5, 0.5)\n",
"h = relu(x)\n",
"\n",
"fig, ax = plt.subplots(1,1, figsize=(10,6)) # Create figure object\n",
"\n",
"# Make actual plots\n",
"# (Notice the label argument!)\n",
"ax.plot(x, f, lw=4, ls='-', label=r'$L(x;1)$')\n",
"ax.plot(x, g, lw=4, ls='--', label=r'$\\tanh(2x)$')\n",
"ax.plot(x, h, lw=4, ls='-.', label=r'$relu(x; 0.01)$')\n",
"\n",
"# Make the tick labels readable\n",
"ax.tick_params(labelsize=24)\n",
"\n",
"# Set axes limits to make the scale nice\n",
"ax.set_xlim(x.min(), x.max())\n",
"ax.set_ylim(h.min(), 1.1)\n",
"\n",
"# Make readable labels\n",
"ax.set_xlabel(r'$x$', fontsize=24)\n",
"ax.set_ylabel(r'$h(x)$', fontsize=24)\n",
"ax.set_title('Activation Functions', fontsize=24)\n",
"\n",
"# Set up grid\n",
"ax.grid(True, lw=1.75, ls='--', alpha=0.75)\n",
"\n",
"# Put legend on figure\n",
"ax.legend(loc='best', fontsize=24);\n",
"\n",
"fig.savefig('../images/nice_plots.png')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"These figures look nice in the plot and it makes sense for comparison. Now let's put the 3 different figures in separate plots.\n",
"\n",
"* Make a separate plot for each figure and line them up on the same row."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/three_subplots.py\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"* Make a grid of 2 x 3 separate plots, 3 will be empty. Just plot the functions and do not worry about cosmetics. We just want you ro see the functionality."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load solutions/six_subplots.py\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 3 - Simple Linear Regression\n",
"\n",
"Linear regression and its many extensions are a workhorse of the statistics and data science community, both in application and as a reference point for other models. Most of the major concepts in machine learning can be and often are discussed in terms of various linear regression models. Thus, this section will introduce you to building and fitting linear regression models and some of the process behind it, so that you can 1) fit models to data you encounter 2) experiment with different kinds of linear regression and observe their effects 3) see some of the technology that makes regression models work.\n",
"\n",
"\n",
"### Linear regression with a toy dataset\n",
"We first examine a toy problem, focusing our efforts on fitting a linear model to a small dataset with three observations. Each observation consists of one predictor $x_i$ and one response $y_i$ for $i = 1, 2, 3$,\n",
"\n",
"\\begin{align*}\n",
"(x , y) = \\{(x_1, y_1), (x_2, y_2), (x_3, y_3)\\}.\n",
"\\end{align*}\n",
"\n",
"To be very concrete, let's set the values of the predictors and responses.\n",
"\n",
"\\begin{equation*}\n",
"(x , y) = \\{(1, 2), (2, 2), (3, 4)\\}\n",
"\\end{equation*}\n",
"\n",
"There is no line of the form $\\beta_0 + \\beta_1 x = y$ that passes through all three observations, since the data are not collinear. Thus our aim is to find the line that best fits these observations in the *least-squares sense*, as discussed in lecture."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise (for home)
\n",
"\n",
"* Make two numpy arrays out of this data, x_train and y_train\n",
"* Check the dimentions of these arrays\n",
"* Try to reshape them into a different shape\n",
"* Make points into a very simple scatterplot\n",
"* Make a better scatterplot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution\n",
"x_train = np.array([1,2,3])\n",
"y_train = np.array([2,3,6])\n",
"type(x_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_train = x_train.reshape(3,1)\n",
"x_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/simple_scatterplot.py\n",
"# Make a simple scatterplot\n",
"plt.scatter(x_train,y_train)\n",
"\n",
"# check dimensions \n",
"print(x_train.shape,y_train.shape)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/nice_scatterplot.py\n",
"def nice_scatterplot(x, y, title):\n",
" # font size\n",
" f_size = 18\n",
" \n",
" # make the figure\n",
" fig, ax = plt.subplots(1,1, figsize=(8,5)) # Create figure object\n",
"\n",
" # set axes limits to make the scale nice\n",
" ax.set_xlim(np.min(x)-1, np.max(x) + 1)\n",
" ax.set_ylim(np.min(y)-1, np.max(y) + 1)\n",
"\n",
" # adjust size of tickmarks in axes\n",
" ax.tick_params(labelsize = f_size)\n",
" \n",
" # remove tick labels\n",
" ax.tick_params(labelbottom=False, bottom=False)\n",
" \n",
" # adjust size of axis label\n",
" ax.set_xlabel(r'$x$', fontsize = f_size)\n",
" ax.set_ylabel(r'$y$', fontsize = f_size)\n",
" \n",
" # set figure title label\n",
" ax.set_title(title, fontsize = f_size)\n",
"\n",
" # you may set up grid with this \n",
" ax.grid(True, lw=1.75, ls='--', alpha=0.15)\n",
"\n",
" # make actual plot (Notice the label argument!)\n",
" #ax.scatter(x, y, label=r'$my points$')\n",
" #ax.scatter(x, y, label='$my points$')\n",
" ax.scatter(x, y, label=r'$my\\,points$')\n",
" ax.legend(loc='best', fontsize = f_size);\n",
" \n",
" return ax\n",
"\n",
"nice_scatterplot(x_train, y_train, 'hello nice plot')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Formulae\n",
"Linear regression is special among the models we study because it can be solved explicitly. While most other models (and even some advanced versions of linear regression) must be solved itteratively, linear regression has a formula where you can simply plug in the data.\n",
"\n",
"For the single predictor case it is:\n",
" \\begin{align}\n",
" \\beta_1 &= \\frac{\\sum_{i=1}^n{(x_i-\\bar{x})(y_i-\\bar{y})}}{\\sum_{i=1}^n{(x_i-\\bar{x})^2}}\\\\\n",
" \\beta_0 &= \\bar{y} - \\beta_1\\bar{x}\\\n",
" \\end{align}\n",
" \n",
"Where $\\bar{y}$ and $\\bar{x}$ are the mean of the y values and the mean of the x values, respectively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Building a model from scratch\n",
"In this part, we will solve the equations for simple linear regression and find the best fit solution to our toy problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The snippets of code below implement the linear regression equations on the observed predictors and responses, which we'll call the training data set. Let's walk through the code.\n",
"\n",
"We have to reshape our arrrays to 2D. We will see later why."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"* make an array with shape (2,3)\n",
"* reshape it to a size that you want"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#solution\n",
"xx = np.array([[1,2,3],[4,6,8]])\n",
"xxx = xx.reshape(-1,2)\n",
"xxx.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reshape to be a proper 2D array\n",
"x_train = x_train.reshape(x_train.shape[0], 1)\n",
"y_train = y_train.reshape(y_train.shape[0], 1)\n",
"\n",
"print(x_train.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# first, compute means\n",
"y_bar = np.mean(y_train)\n",
"x_bar = np.mean(x_train)\n",
"\n",
"# build the two terms\n",
"numerator = np.sum( (x_train - x_bar)*(y_train - y_bar) )\n",
"denominator = np.sum((x_train - x_bar)**2)\n",
"\n",
"print(numerator.shape, denominator.shape) #check shapes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Why the empty brackets? (The numerator and denominator are scalars, as expected.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#slope beta1\n",
"beta_1 = numerator/denominator\n",
"\n",
"#intercept beta0\n",
"beta_0 = y_bar - beta_1*x_bar\n",
"\n",
"print(\"The best-fit line is {0:3.2f} + {1:3.2f} * x\".format(beta_0, beta_1))\n",
"print(f'The best fit is {beta_0}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"Turn the code from the above cells into a function called `simple_linear_regression_fit`, that inputs the training data and returns `beta0` and `beta1`.\n",
"\n",
"To do this, copy and paste the code from the above cells below and adjust the code as needed, so that the training data becomes the input and the betas become the output.\n",
"\n",
"```python\n",
"def simple_linear_regression_fit(x_train: np.ndarray, y_train: np.ndarray) -> np.ndarray:\n",
" \n",
" return\n",
"```\n",
"\n",
"Check your function by calling it with the training data from above and printing out the beta values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/simple_linear_regression_fit.py\n",
"def simple_linear_regression_fit(x_train: np.ndarray, y_train: np.ndarray) -> np.ndarray:\n",
" \"\"\"\n",
" Inputs:\n",
" x_train: a (num observations by 1) array holding the values of the predictor variable\n",
" y_train: a (num observations by 1) array holding the values of the response variable\n",
"\n",
" Returns:\n",
" beta_vals: a (num_features by 1) array holding the intercept and slope coeficients\n",
" \"\"\"\n",
" \n",
" # Check input array sizes\n",
" if len(x_train.shape) < 2:\n",
" print(\"Reshaping features array.\")\n",
" x_train = x_train.reshape(x_train.shape[0], 1)\n",
"\n",
" if len(y_train.shape) < 2:\n",
" print(\"Reshaping observations array.\")\n",
" y_train = y_train.reshape(y_train.shape[0], 1)\n",
"\n",
" # first, compute means\n",
" y_bar = np.mean(y_train)\n",
" x_bar = np.mean(x_train)\n",
"\n",
" # build the two terms\n",
" numerator = np.sum( (x_train - x_bar)*(y_train - y_bar) )\n",
" denominator = np.sum((x_train - x_bar)**2)\n",
" \n",
" #slope beta1\n",
" beta_1 = numerator/denominator\n",
"\n",
" #intercept beta0\n",
" beta_0 = y_bar - beta_1*x_bar\n",
"\n",
" return np.array([beta_0,beta_1])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Let's run this function and see the coefficients"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_train = np.array([1 ,2, 3])\n",
"y_train = np.array([2, 2, 4])\n",
"\n",
"betas = simple_linear_regression_fit(x_train, y_train)\n",
"\n",
"beta_0 = betas[0]\n",
"beta_1 = betas[1]\n",
"\n",
"print(\"The best-fit line is {0:8.6f} + {1:8.6f} * x\".format(beta_0, beta_1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"* Do the values of `beta0` and `beta1` seem reasonable?\n",
"* Plot the training data using a scatter plot.\n",
"* Plot the best fit line with `beta0` and `beta1` together with the training data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/best_fit_scatterplot.py\n",
"fig_scat, ax_scat = plt.subplots(1,1, figsize=(10,6))\n",
"\n",
"# Plot best-fit line\n",
"x_train = np.array([[1, 2, 3]]).T\n",
"\n",
"best_fit = beta_0 + beta_1 * x_train\n",
"\n",
"ax_scat.scatter(x_train, y_train, s=300, label='Training Data')\n",
"ax_scat.plot(x_train, best_fit, ls='--', label='Best Fit Line')\n",
"\n",
"ax_scat.set_xlabel(r'$x_{train}$')\n",
"ax_scat.set_ylabel(r'$y$');\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values of `beta0` and `beta1` seem roughly reasonable. They capture the positive correlation. The line does appear to be trying to get as close as possible to all the points."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 4 - Building a model with `statsmodels` and `sklearn`\n",
"\n",
"Now that we can concretely fit the training data from scratch, let's learn two `python` packages to do it all for us:\n",
"* [statsmodels](http://www.statsmodels.org/stable/regression.html) and \n",
"* [scikit-learn (sklearn)](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).\n",
"\n",
"Our goal is to show how to implement simple linear regression with these packages. For an important sanity check, we compare the $\\beta$ values from `statsmodels` and `sklearn` to the $\\beta$ values that we found from above with our own implementation.\n",
"\n",
"For the purposes of this lab, `statsmodels` and `sklearn` do the same thing. More generally though, `statsmodels` tends to be easier for inference \\[finding the values of the slope and intercept and dicussing uncertainty in those values\\], whereas `sklearn` has machine-learning algorithms and is better for prediction \\[guessing y values for a given x value\\]. (Note that both packages make the same guesses, it's just a question of which activity they provide more support for.\n",
"\n",
"**Note:** `statsmodels` and `sklearn` are different packages! Unless we specify otherwise, you can use either one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Why do we need to add a constant in our simple linear regression model? \n",
"\n",
"Let's say we a data set of two obsevations with one predictor and one response variable each. We would then have the following two equations if we run a simple linear regression model. $$y_1=\\beta_0 + \\beta_1*x_1$$ $$y_2=\\beta_0 + \\beta_1*x_2$$
For simplicity and calculation efficiency we want to \"absorb\" the constant $b_0$ into an array with $b_1$ so we have only multiplication. To do this we introduce the constant ${x}^0=1$
$$y_1=\\beta_0*{x_1}^0 + \\beta_1*x_1$$ $$y_2=\\beta_0 * {x_2}^0 + \\beta_1*x_2$$
That becomes: \n",
"$$y_1=\\beta_0*1 + \\beta_1*x_1$$ $$y_2=\\beta_0 * 1 + \\beta_1*x_2$$
\n",
" \n",
"In matrix notation: \n",
" \n",
"$$\n",
"\\left [\n",
"\\begin{array}{c}\n",
"y_1 \\\\ y_2 \\\\\n",
"\\end{array}\n",
"\\right] =\n",
"\\left [\n",
"\\begin{array}{cc}\n",
"1& x_1 \\\\ 1 & x_2 \\\\\n",
"\\end{array}\n",
"\\right] \n",
"\\cdot\n",
"\\left [\n",
"\\begin{array}{c}\n",
"\\beta_0 \\\\ \\beta_1 \\\\\n",
"\\end{array}\n",
"\\right]\n",
"$$\n",
"
\n",
" \n",
"`sklearn` adds the constant for us where in `statsmodels` we need to explicitly add it using `sm.add_constant`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is the code for `statsmodels`. `Statsmodels` does not by default include the column of ones in the $X$ matrix, so we include it manually with `sm.add_constant`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create the X matrix by appending a column of ones to x_train\n",
"X = sm.add_constant(x_train)\n",
"\n",
"# this is the same matrix as in our scratch problem!\n",
"print(X)\n",
"\n",
"# build the OLS model (ordinary least squares) from the training data\n",
"toyregr_sm = sm.OLS(y_train, X)\n",
"\n",
"# do the fit and save regression info (parameters, etc) in results_sm\n",
"results_sm = toyregr_sm.fit()\n",
"\n",
"# pull the beta parameters out from results_sm\n",
"beta0_sm = results_sm.params[0]\n",
"beta1_sm = results_sm.params[1]\n",
"\n",
"print(f'The regression coef from statsmodels are: beta_0 = {beta0_sm:8.6f} and beta_1 = {beta1_sm:8.6f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Besides the beta parameters, `results_sm` contains a ton of other potentially useful information."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"print(results_sm.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's turn our attention to the `sklearn` library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import linear_model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# build the least squares model\n",
"toyregr = linear_model.LinearRegression()\n",
"\n",
"# save regression info (parameters, etc) in results_skl\n",
"results = toyregr.fit(x_train, y_train)\n",
"\n",
"# pull the beta parameters out from results_skl\n",
"beta0_skl = toyregr.intercept_\n",
"beta1_skl = toyregr.coef_[0]\n",
"\n",
"print(\"The regression coefficients from the sklearn package are: beta_0 = {0:8.6f} and beta_1 = {1:8.6f}\".format(beta0_skl, beta1_skl))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should feel pretty good about ourselves now, and we're ready to move on to a real problem!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The `scikit-learn` library and the shape of things"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before diving into a \"real\" problem, let's discuss more of the details of `sklearn`.\n",
"\n",
"`Scikit-learn` is the main `Python` machine learning library. It consists of many learners which can learn models from data, as well as a lot of utility functions such as `train_test_split()`. \n",
"\n",
"Use the following to add the library into your code:\n",
"\n",
"```python\n",
"import sklearn \n",
"```\n",
"\n",
"In `scikit-learn`, an **estimator** is a Python object that implements the methods `fit(X, y)` and `predict(T)`\n",
"\n",
"Let's see the structure of `scikit-learn` needed to make these fits. `fit()` always takes two arguments:\n",
"```python\n",
"estimator.fit(Xtrain, ytrain)\n",
"```\n",
"We will consider two estimators in this lab: `LinearRegression` and `KNeighborsRegressor`.\n",
"\n",
"It is very important to understand that `Xtrain` must be in the form of a **2x2 array** with each row corresponding to one sample, and each column corresponding to the feature values for that sample.\n",
"\n",
"`ytrain` on the other hand is a simple array of responses. These are continuous for regression problems."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Practice with `sklearn` and a real dataset\n",
"We begin by loading up the `mtcars` dataset. This data was extracted from the 1974 Motor Trend US magazine, and comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). We will load this data to a dataframe with 32 observations on 11 (numeric) variables. Here is an explanation of the features:\n",
"\n",
"- `mpg` is Miles/(US) gallon \n",
"- `cyl` is Number of cylinders, \n",
"- `disp` is\tDisplacement (cu.in.), \n",
"- `hp` is\tGross horsepower, \n",
"- `drat` is\tRear axle ratio, \n",
"- `wt` is the Weight (1000 lbs), \n",
"- `qsec` is 1/4 mile time,\n",
"- `vs` is Engine (0 = V-shaped, 1 = straight), \n",
"- `am` is Transmission (0 = automatic, 1 = manual), \n",
"- `gear` is the Number of forward gears, \n",
"- `carb` is\tNumber of carburetors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"#load mtcars\n",
"dfcars = pd.read_csv(\"../data/mtcars.csv\")\n",
"dfcars.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fix the column title \n",
"dfcars = dfcars.rename(columns={\"Unnamed: 0\":\"car name\"})\n",
"dfcars.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dfcars.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Searching for values: how many cars have 4 gears?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(dfcars[dfcars.gear == 4].drop_duplicates(subset='car name', keep='first'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's split the dataset into a training set and test set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# split into training set and testing set\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"#set random_state to get the same split every time\n",
"traindf, testdf = train_test_split(dfcars, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# testing set is around 20% of the total data; training set is around 80%\n",
"print(\"Shape of full dataset is: {0}\".format(dfcars.shape))\n",
"print(\"Shape of training dataset is: {0}\".format(traindf.shape))\n",
"print(\"Shape of test dataset is: {0}\".format(testdf.shape))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have training and test data. We still need to select a predictor and a response from this dataset. Keep in mind that we need to choose the predictor and response from both the training and test set. You will do this in the exercises below. However, we provide some starter code for you to get things going."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"traindf.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract the response variable that we're interested in\n",
"y_train = traindf.mpg\n",
"y_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"Use slicing to get the same vector `y_train`\n",
"\n",
"----"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, notice the shape of `y_train`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_train.shape, type(y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Array reshape\n",
"This is a 1D array as should be the case with the **Y** array. Remember, `sklearn` requires a 2D array only for the predictor array. You will have to pay close attention to this in the exercises later. `Sklearn` doesn't care too much about the shape of `y_train`.\n",
"\n",
"The whole reason we went through that whole process was to show you how to reshape your data into the correct format.\n",
"\n",
"**IMPORTANT:** Remember that your response variable `ytrain` can be a vector but your predictor variable `xtrain` ***must*** be an array!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 5 - Example: Simple linear regression with automobile data\n",
"We will now use `sklearn` to predict automobile mileage per gallon (mpg) and evaluate these predictions. We already loaded the data and split them into a training set and a test set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to choose the variables that we think will be good predictors for the dependent variable `mpg`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise in pairs
\n",
"\n",
"* Pick one variable to use as a predictor for simple linear regression. Discuss your reasons with the person next to you. \n",
"* Justify your choice with some visualizations. \n",
"* Is there a second variable you'd like to use? For example, we're not doing multiple linear regression here, but if we were, is there another variable you'd like to include if we were using two predictors?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_wt = dfcars.wt\n",
"x_wt.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/cars_simple_EDA.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"* Use `sklearn` to fit the training data using simple linear regression.\n",
"* Use the model to make mpg predictions on the test set. \n",
"* Plot the data and the prediction. \n",
"* Print out the mean squared error for the training set and the test set and compare."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error\n",
"\n",
"dfcars = pd.read_csv(\"../data/mtcars.csv\")\n",
"dfcars = dfcars.rename(columns={\"Unnamed: 0\":\"name\"})\n",
"\n",
"dfcars.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"traindf, testdf = train_test_split(dfcars, test_size=0.2, random_state=42)\n",
"\n",
"y_train = np.array(traindf.mpg)\n",
"X_train = np.array(traindf.wt)\n",
"X_train = X_train.reshape(X_train.shape[0], 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_test = np.array(testdf.mpg)\n",
"X_test = np.array(testdf.wt)\n",
"X_test = X_test.reshape(X_test.shape[0], 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's take another look at our data\n",
"dfcars.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# And out train and test sets \n",
"y_train.shape, X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_test.shape, X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#create linear model\n",
"regression = LinearRegression()\n",
"\n",
"#fit linear model\n",
"regression.fit(X_train, y_train)\n",
"\n",
"predicted_y = regression.predict(X_test)\n",
"\n",
"r2 = regression.score(X_test, y_test)\n",
"print(f'R^2 = {r2:.5}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(regression.score(X_train, y_train))\n",
"\n",
"print(mean_squared_error(predicted_y, y_test))\n",
"print(mean_squared_error(y_train, regression.predict(X_train)))\n",
"\n",
"print('Coefficients: \\n', regression.coef_[0], regression.intercept_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1,1, figsize=(10,6))\n",
"ax.plot(y_test, predicted_y, 'o')\n",
"grid = np.linspace(np.min(dfcars.mpg), np.max(dfcars.mpg), 100)\n",
"ax.plot(grid, grid, color=\"black\") # 45 degree line\n",
"ax.set_xlabel(\"actual y\")\n",
"ax.set_ylabel(\"predicted y\")\n",
"\n",
"fig1, ax1 = plt.subplots(1,1, figsize=(10,6))\n",
"ax1.plot(dfcars.wt, dfcars.mpg, 'o')\n",
"xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)\n",
"ax1.plot(xgrid, regression.predict(xgrid.reshape(100, 1)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 6 - $k$-nearest neighbors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you're familiar with `sklearn`, you're ready to do a KNN regression. \n",
"\n",
"Sklearn's regressor is called `sklearn.neighbors.KNeighborsRegressor`. Its main parameter is the `number of nearest neighbors`. There are other parameters such as the distance metric (default for 2 order is the Euclidean distance). For a list of all the parameters see the [Sklearn kNN Regressor Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html).\n",
"\n",
"Let's use $5$ nearest neighbors."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the library\n",
"from sklearn.neighbors import KNeighborsRegressor"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set number of neighbors\n",
"k = 5\n",
"knnreg = KNeighborsRegressor(n_neighbors=k)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fit the regressor - make sure your numpy arrays are the right shape\n",
"knnreg.fit(X_train, y_train)\n",
"\n",
"# Evaluate the outcome on the train set using R^2\n",
"r2_train = knnreg.score(X_train, y_train)\n",
"\n",
"# Print results\n",
"print(f'kNN model with {k} neighbors gives R^2 on the train set: {r2_train:.5}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"knnreg.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"Calculate and print the $R^{2}$ score on the test set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not so good? Lets vary the number of neighbors and see what we get."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make our lives easy by storing the different regressors in a dictionary\n",
"regdict = {}\n",
"\n",
"# Make our lives easier by entering the k values from a list\n",
"k_list = [1, 2, 4, 15]\n",
"\n",
"# Do a bunch of KNN regressions\n",
"for k in k_list:\n",
" knnreg = KNeighborsRegressor(n_neighbors=k)\n",
" knnreg.fit(X_train, y_train)\n",
" # Store the regressors in a dictionary\n",
" regdict[k] = knnreg \n",
"\n",
"# Print the dictionary to see what we have\n",
"regdict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's plot all the k values in same plot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1,1, figsize=(10,6))\n",
"\n",
"ax.plot(dfcars.wt, dfcars.mpg, 'o', label=\"data\")\n",
"\n",
"xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)\n",
"\n",
"# let's unpack the dictionary to its elements (items) which is the k and Regressor\n",
"for k, regressor in regdict.items():\n",
" predictions = regressor.predict(xgrid.reshape(-1,1)) \n",
" ax.plot(xgrid, predictions, label=\"{}-NN\".format(k))\n",
"\n",
"ax.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"Explain what you see in the graph. **Hint** Notice how the $1$-NN goes through every point on the training set but utterly fails elsewhere. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look at the scores on the training set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ks = range(1, 15) # Grid of k's\n",
"scores_train = [] # R2 scores\n",
"for k in ks:\n",
" # Create KNN model\n",
" knnreg = KNeighborsRegressor(n_neighbors=k) \n",
" \n",
" # Fit the model to training data\n",
" knnreg.fit(X_train, y_train) \n",
" \n",
" # Calculate R^2 score\n",
" score_train = knnreg.score(X_train, y_train) \n",
" scores_train.append(score_train)\n",
"\n",
"# Plot\n",
"fig, ax = plt.subplots(1,1, figsize=(12,8))\n",
"ax.plot(ks, scores_train,'o-')\n",
"ax.set_xlabel(r'$k$')\n",
"ax.set_ylabel(r'$R^{2}$')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise
\n",
"\n",
"* Why do we get a perfect $R^2$ at k=1 for the training set?\n",
"* Make the same plot as above on the *test* set.\n",
"* What is the best $k$?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# %load solutions/knn_regression.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# solution to previous exercise\n",
"r2_test = knnreg.score(X_test, y_test)\n",
"print(f'kNN model with {k} neighbors gives R^2 on the test set: {r2_test:.5}')"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}