{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS109A Introduction to Data Science \n", "\n", "## Standard Section 4: Regularization and Model Selection\n", "\n", "**Harvard University**
\n", "**Fall 2019**
\n", "**Instructors**: Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n", "**Section Leaders**: Marios Mattheakis, Abhimanyu (Abhi) Vasishth, Robbert (Rob) Struyven
\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#RUN THIS CELL \n", "import requests\n", "from IPython.core.display import HTML\n", "styles = requests.get(\"http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n", "HTML(styles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this section, our goal is to get you familiarized with Regularization in Multiple Linear Regression and to start thinking about Model and Hyper-Parameter Selection. \n", "\n", "Specifically, we will:\n", "\n", "- Load in the King County House Price Dataset\n", "- Perform some basic EDA\n", "- Split the data up into a training, **validation**, and test set (we'll see why we need a validation set)\n", "- Scale the variables (by standardizing them) and seeing why we need to do this\n", "- Make our multiple & polynomial regression models (like we did in the previous section)\n", "- Learn what **regularization** is and how it can help\n", "- Understand **ridge** and **lasso** regression\n", "- Get an introduction to **cross-validation** using RidgeCV and LassoCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Data and Stats packages\n", "import numpy as np\n", "import pandas as pd\n", "pd.set_option('max_columns', 200)\n", "\n", "# Visualization packages\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set()\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# EDA: House Prices Data From Kaggle\n", "\n", "For our dataset, we'll be using the house price dataset from [King County, WA](https://en.wikipedia.org/wiki/King_County,_Washington). The dataset is from [Kaggle](https://www.kaggle.com/harlfoxem/housesalesprediction). \n", "\n", "The task is to build a regression model to **predict the price**, based on different attributes. First, let's do some EDA." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4000, 21)\n", "id int64\n", "date object\n", "price float64\n", "bedrooms int64\n", "bathrooms float64\n", "sqft_living int64\n", "sqft_lot int64\n", "floors float64\n", "waterfront int64\n", "view int64\n", "condition int64\n", "grade int64\n", "sqft_above int64\n", "sqft_basement int64\n", "yr_built int64\n", "yr_renovated int64\n", "zipcode int64\n", "lat float64\n", "long float64\n", "sqft_living15 int64\n", "sqft_lot15 int64\n", "dtype: object\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatepricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
735259182031020141006T000000365000.042.25207088932.0004820700198609805847.4388-122.16223907700
2830797420082020140821T000000865000.053.00290067301.0005818301070197709811547.6784-122.28523706283
4106770145011020140815T0000001038000.042.503770108932.00231137700199709800647.5646-122.12937109685
16218952230001020150331T0000001490000.033.504560146082.00231245600199009803447.6995-122.228405014226
19964951086114020140714T000000711000.032.50255053762.0003925500200409805247.6647-122.08322504050
\n", "
" ], "text/plain": [ " id date price bedrooms bathrooms \\\n", "735 2591820310 20141006T000000 365000.0 4 2.25 \n", "2830 7974200820 20140821T000000 865000.0 5 3.00 \n", "4106 7701450110 20140815T000000 1038000.0 4 2.50 \n", "16218 9522300010 20150331T000000 1490000.0 3 3.50 \n", "19964 9510861140 20140714T000000 711000.0 3 2.50 \n", "\n", " sqft_living sqft_lot floors waterfront view condition grade \\\n", "735 2070 8893 2.0 0 0 4 8 \n", "2830 2900 6730 1.0 0 0 5 8 \n", "4106 3770 10893 2.0 0 2 3 11 \n", "16218 4560 14608 2.0 0 2 3 12 \n", "19964 2550 5376 2.0 0 0 3 9 \n", "\n", " sqft_above sqft_basement yr_built yr_renovated zipcode lat \\\n", "735 2070 0 1986 0 98058 47.4388 \n", "2830 1830 1070 1977 0 98115 47.6784 \n", "4106 3770 0 1997 0 98006 47.5646 \n", "16218 4560 0 1990 0 98034 47.6995 \n", "19964 2550 0 2004 0 98052 47.6647 \n", "\n", " long sqft_living15 sqft_lot15 \n", "735 -122.162 2390 7700 \n", "2830 -122.285 2370 6283 \n", "4106 -122.129 3710 9685 \n", "16218 -122.228 4050 14226 \n", "19964 -122.083 2250 4050 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load the dataset \n", "house_df = pd.read_csv('../data/kc_house_data.csv')\n", "house_df = house_df.sample(frac=1, random_state=42)[0:4000]\n", "print(house_df.shape)\n", "print(house_df.dtypes)\n", "house_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's check for null values and look at the datatypes within the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 4000 entries, 735 to 3455\n", "Data columns (total 21 columns):\n", "id 4000 non-null int64\n", "date 4000 non-null object\n", "price 4000 non-null float64\n", "bedrooms 4000 non-null int64\n", "bathrooms 4000 non-null float64\n", "sqft_living 4000 non-null int64\n", "sqft_lot 4000 non-null int64\n", "floors 4000 non-null float64\n", "waterfront 4000 non-null int64\n", "view 4000 non-null int64\n", "condition 4000 non-null int64\n", "grade 4000 non-null int64\n", "sqft_above 4000 non-null int64\n", "sqft_basement 4000 non-null int64\n", "yr_built 4000 non-null int64\n", "yr_renovated 4000 non-null int64\n", "zipcode 4000 non-null int64\n", "lat 4000 non-null float64\n", "long 4000 non-null float64\n", "sqft_living15 4000 non-null int64\n", "sqft_lot15 4000 non-null int64\n", "dtypes: float64(5), int64(15), object(1)\n", "memory usage: 687.5+ KB\n" ] } ], "source": [ "house_df.info()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idpricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15
count4.000000e+034.000000e+034000.0000004000.0000004000.0000004.000000e+034000.0000004000.0000004000.0000004000.0000004000.0000004000.0000004000.000004000.0000004000.0000004000.0000004000.0000004000.0000004000.000004000.00000
mean4.586542e+095.497522e+053.3792502.1165632096.6452501.616511e+041.4750000.0077500.2325003.4207507.6682501792.465000304.180251970.56425089.80150098078.03550047.560091-122.2140601997.7590012790.67800
std2.876700e+093.890505e+050.9225680.783175957.7851415.120888e+040.5302790.0877030.7681740.6463931.194173849.986192455.2635429.141872413.76008254.0733740.1390700.141879701.6098726085.20301
min1.000102e+068.250000e+040.0000000.000000384.0000005.720000e+021.0000000.0000000.0000001.0000004.000000384.0000000.000001900.0000000.00000098001.00000047.155900-122.515000620.00000659.00000
25%2.126074e+093.249500e+053.0000001.7500001420.0000005.200000e+031.0000000.0000000.0000003.0000007.0000001180.0000000.000001951.0000000.00000098033.00000047.468175-122.3280001490.000005200.00000
50%3.889350e+094.550000e+053.0000002.2500001920.0000007.675000e+031.0000000.0000000.0000003.0000007.0000001550.0000000.000001974.5000000.00000098065.00000047.573800-122.2310001840.000007628.00000
75%7.334526e+096.541250e+054.0000002.5000002570.0000001.087125e+042.0000000.0000000.0000004.0000008.0000002250.000000590.000001995.0000000.00000098118.00000047.679100-122.1270002370.0000010240.00000
max9.842300e+095.570000e+0611.0000008.00000013540.0000001.651359e+063.5000001.0000004.0000005.00000013.0000009410.0000004130.000002015.0000002015.00000098199.00000047.777500-121.3150005790.00000560617.00000
\n", "
" ], "text/plain": [ " id price bedrooms bathrooms sqft_living \\\n", "count 4.000000e+03 4.000000e+03 4000.000000 4000.000000 4000.000000 \n", "mean 4.586542e+09 5.497522e+05 3.379250 2.116563 2096.645250 \n", "std 2.876700e+09 3.890505e+05 0.922568 0.783175 957.785141 \n", "min 1.000102e+06 8.250000e+04 0.000000 0.000000 384.000000 \n", "25% 2.126074e+09 3.249500e+05 3.000000 1.750000 1420.000000 \n", "50% 3.889350e+09 4.550000e+05 3.000000 2.250000 1920.000000 \n", "75% 7.334526e+09 6.541250e+05 4.000000 2.500000 2570.000000 \n", "max 9.842300e+09 5.570000e+06 11.000000 8.000000 13540.000000 \n", "\n", " sqft_lot floors waterfront view condition \\\n", "count 4.000000e+03 4000.000000 4000.000000 4000.000000 4000.000000 \n", "mean 1.616511e+04 1.475000 0.007750 0.232500 3.420750 \n", "std 5.120888e+04 0.530279 0.087703 0.768174 0.646393 \n", "min 5.720000e+02 1.000000 0.000000 0.000000 1.000000 \n", "25% 5.200000e+03 1.000000 0.000000 0.000000 3.000000 \n", "50% 7.675000e+03 1.000000 0.000000 0.000000 3.000000 \n", "75% 1.087125e+04 2.000000 0.000000 0.000000 4.000000 \n", "max 1.651359e+06 3.500000 1.000000 4.000000 5.000000 \n", "\n", " grade sqft_above sqft_basement yr_built yr_renovated \\\n", "count 4000.000000 4000.000000 4000.00000 4000.000000 4000.000000 \n", "mean 7.668250 1792.465000 304.18025 1970.564250 89.801500 \n", "std 1.194173 849.986192 455.26354 29.141872 413.760082 \n", "min 4.000000 384.000000 0.00000 1900.000000 0.000000 \n", "25% 7.000000 1180.000000 0.00000 1951.000000 0.000000 \n", "50% 7.000000 1550.000000 0.00000 1974.500000 0.000000 \n", "75% 8.000000 2250.000000 590.00000 1995.000000 0.000000 \n", "max 13.000000 9410.000000 4130.00000 2015.000000 2015.000000 \n", "\n", " zipcode lat long sqft_living15 sqft_lot15 \n", "count 4000.000000 4000.000000 4000.000000 4000.00000 4000.00000 \n", "mean 98078.035500 47.560091 -122.214060 1997.75900 12790.67800 \n", "std 54.073374 0.139070 0.141879 701.60987 26085.20301 \n", "min 98001.000000 47.155900 -122.515000 620.00000 659.00000 \n", "25% 98033.000000 47.468175 -122.328000 1490.00000 5200.00000 \n", "50% 98065.000000 47.573800 -122.231000 1840.00000 7628.00000 \n", "75% 98118.000000 47.679100 -122.127000 2370.00000 10240.00000 \n", "max 98199.000000 47.777500 -121.315000 5790.00000 560617.00000 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "house_df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's choose a subset of columns here. **NOTE**: The way I'm selecting columns here is not principled and is just for convenience. In your homework assignments (and in real life), we expect you to choose columns more rigorously.\n", "\n", "1. `bedrooms`\n", "2. `bathrooms`\n", "3. `sqft_living`\n", "4. `sqft_lot`\n", "5. `floors`\n", "6. `sqft_above`\n", "7. `sqft_basement`\n", "8. `lat`\n", "9. `long`\n", "10. **`price`**: Our response variable" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "cols_of_interest = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement',\n", " 'lat', 'long', 'price']\n", "house_df = house_df[cols_of_interest]\n", "\n", "# Convert house price to 1000s of dollars\n", "house_df['price'] = house_df['price']/1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how the response variable (`price`) is distributed" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(12,5))\n", "ax.hist(house_df['price'], bins=100)\n", "ax.set_title('Histogram of house price (in 1000s of dollars)');" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# This takes a bit of time but is worth it!!\n", "# sns.pairplot(house_df);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train-Validation-Test Split\n", "\n", "Up until this point, we have only had a train-test split. Why are we introducing a validation set? What's the point?\n", "\n", "This is the general idea: \n", "\n", "1. **Training Set**: Data you have seen. You train different types of models with various different hyper-parameters and regularization parameters on this data. \n", "\n", "\n", "2. **Validation Set**: Used to compare different models. We use this step to tune our hyper-parameters i.e. find the optimal set of hyper-parameters (such as $k$ for k-NN or our $\\beta_i$ values or number of degrees of our polynomial for linear regression). Pick your best model here. \n", "\n", "\n", "\n", "3. **Test Set**: Using the best model from the previous step, simply report the score e.g. R^2 score, MSE or any metric that you care about, of that model on your test set. **DON'T TUNE YOUR PARAMETERS HERE!**. Why, I hear you ask? Because we want to know how our model might do on data it hasn't seen before. We don't have access to this data (because it may not exist yet) but the test set, which we haven't seen or touched so far, is a good way to mimic this new data. \n", "\n", "Let's do 60% train, 20% validation, 20% test for this dataset." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train Set: 60.00%\n", "Validation Set: 20.00%\n", "Test Set: 20.00%\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# first split the data into a train-test split and don't touch the test set yet\n", "train_df, test_df = train_test_split(house_df, test_size=0.2, random_state=42)\n", "\n", "# next, split the training set into a train-validation split\n", "# the test-size is 0.25 since we are splitting 80% of the data into 20% and 60% overall\n", "train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42)\n", "\n", "print('Train Set: {0:0.2f}%'.format(100*train_df.size/house_df.size))\n", "print('Validation Set: {0:0.2f}%'.format(100*val_df.size/house_df.size))\n", "print('Test Set: {0:0.2f}%'.format(100*test_df.size/house_df.size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modeling\n", "\n", "In the [last section](https://github.com/Harvard-IACS/2019-CS109A/tree/master/content/sections/section3), we went over the mechanics of Multiple Linear Regression and created models that had interaction terms and polynomial terms. Specifically, we dealt with the following sorts of models. \n", "\n", "$$\n", "y = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 + \\dots + \\beta_M x_M\n", "$$\n", "\n", "Let's adopt a similar process here and get a few different models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Design Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our model setup in the equation in the previous section, we obtain the following: \n", "\n", "$$\n", "Y = \\begin{bmatrix}\n", "y_1 \\\\\n", "y_2 \\\\\n", "\\vdots \\\\\n", "y_n\n", "\\end{bmatrix}, \\quad X = \\begin{bmatrix}\n", "x_{1,1} & x_{1,2} & \\dots & x_{1,M} \\\\\n", "x_{2,1} & x_{2,2} & \\dots & x_{2,M} \\\\\n", "\\vdots & \\vdots & \\ddots & \\vdots \\\\\n", "x_{n,1} & x_{n,2} & \\dots & x_{n,M} \\\\\n", "\\end{bmatrix}, \\quad \\beta = \\begin{bmatrix}\n", "\\beta_1 \\\\\n", "\\beta_2 \\\\\n", "\\vdots \\\\\n", "\\beta_M\n", "\\end{bmatrix}, \\quad \\epsilon = \\begin{bmatrix}\n", "\\epsilon_1 \\\\\n", "\\epsilon_2 \\\\\n", "\\vdots \\\\\n", "\\epsilon_n\n", "\\end{bmatrix},\n", "$$\n", "\n", "$X$ is an n$\\times$M matrix: this is our **design matrix**, $\\beta$ is an M-dimensional vector (an M$\\times$1 matrix), and $Y$ is an n-dimensional vector (an n$\\times$1 matrix). In addition, we know that $\\epsilon$ is an n-dimensional vector (an n$\\times$1 matrix)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 10)\n", "(2400,)\n" ] } ], "source": [ "X = train_df[cols_of_interest]\n", "y = train_df['price']\n", "print(X.shape)\n", "print(y.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaling our Design Matrix\n", "\n", "### Warm-Up Exercise\n", "\n", "Warm-Up Exercise: for which of the following do the units of the predictors matter (e.g., trip length in minutes vs seconds; temperature in F or C)? A similar question would be: for which of these models do the magnitudes of values taken by different predictors matter? \n", "\n", "(We will go over Ridge and Lasso Regression in greater detail later)\n", "\n", "- k-NN (Nearest Neighbors regression)\n", "- Linear regression\n", "- Lasso regression\n", "- Ridge regression\n", "\n", "**Solutions**\n", "\n", "- kNN: **yes**. Scaling affects distance metric, which determines what \"neighbor\" means\n", "- Linear regression: **no**. Multiply predictor by $c$ -> divide coef by $c$.\n", "- Lasso: **yes**: If we divided coef by $c$, then corresponding penalty term is also divided by $c$.\n", "- Ridge: **yes**: Same as Lasso, except penalty divided by $c^2$.\n", "\n", "### Standard Scaler (Standardization)\n", " \n", "[Here's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) the scikit-learn implementation of the standard scaler. What is it doing though? Hint: you may have seen this in STAT 110 or another statistics course multiple times.\n", "\n", "$$\n", "z = \\frac{x-\\mu}{\\sigma}\n", "$$\n", "\n", "In the above setup: \n", "\n", "- $z$ is the standardized variable\n", "- $x$ is the variable before standardization\n", "- $\\mu$ is the mean of the variable before standardization\n", "- $\\sigma$ is the standard deviation of the variable before standardization\n", "\n", "Let's see an example of how this works:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xz_manualz_sklearn
count4000.0000004.000000e+034.000000e+03
mean2096.645250-2.775558e-17-4.096723e-17
std957.7851411.000000e+001.000125e+00
min384.000000-1.788131e+00-1.788355e+00
25%1420.000000-7.064687e-01-7.065571e-01
50%1920.000000-1.844310e-01-1.844540e-01
75%2570.0000004.942181e-014.942799e-01
max13540.0000001.194773e+011.194922e+01
\n", "
" ], "text/plain": [ " x z_manual z_sklearn\n", "count 4000.000000 4.000000e+03 4.000000e+03\n", "mean 2096.645250 -2.775558e-17 -4.096723e-17\n", "std 957.785141 1.000000e+00 1.000125e+00\n", "min 384.000000 -1.788131e+00 -1.788355e+00\n", "25% 1420.000000 -7.064687e-01 -7.065571e-01\n", "50% 1920.000000 -1.844310e-01 -1.844540e-01\n", "75% 2570.000000 4.942181e-01 4.942799e-01\n", "max 13540.000000 1.194773e+01 1.194922e+01" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "x = house_df['sqft_living']\n", "mu = x.mean()\n", "sigma = x.std()\n", "z = (x-mu)/sigma\n", "\n", "# reshaping x to be a n by 1 matrix since that's how scikit learn likes data for standardization\n", "x_reshaped = np.array(x).reshape(-1,1)\n", "z_sklearn = StandardScaler().fit_transform(x_reshaped)\n", "\n", "# Plotting the histogram of the variable before standardization\n", "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(24,5))\n", "ax = ax.ravel()\n", "\n", "ax[0].hist(x, bins=100)\n", "ax[0].set_title('Histogram of sqft_living before standardization')\n", "\n", "ax[1].hist(z, bins=100)\n", "ax[1].set_title('Manually standardizing sqft_living')\n", "\n", "ax[2].hist(z_sklearn, bins=100)\n", "ax[2].set_title('Standardizing sqft_living using scikit learn');\n", "\n", "# making things a dataframe to check if they work\n", "pd.DataFrame({'x': x, 'z_manual': z, 'z_sklearn': z_sklearn.flatten()}).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Min-Max Scaler (Normalization)\n", "\n", "[Here's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) the scikit-learn implementation of the standard scaler. What is it doing though? \n", "\n", "$$\n", "x_{new} = \\frac{x-x_{min}}{x_{max}-x_{min}}\n", "$$\n", "\n", "In the above setup: \n", "\n", "- $x_{new}$ is the normalized variable\n", "- $x$ is the variable before normalized\n", "- $x_{max}$ is the max value of the variable before normalization\n", "- $x_{min}$ is the min value of the variable before normalization\n", "\n", "Let's see an example of how this works:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xx_new_manualx_new_sklearn
count4000.0000004000.0000004000.000000
mean2096.6452500.1301800.130180
std957.7851410.0728020.072802
min384.0000000.0000000.000000
25%1420.0000000.0787470.078747
50%1920.0000000.1167530.116753
75%2570.0000000.1661600.166160
max13540.0000001.0000001.000000
\n", "
" ], "text/plain": [ " x x_new_manual x_new_sklearn\n", "count 4000.000000 4000.000000 4000.000000\n", "mean 2096.645250 0.130180 0.130180\n", "std 957.785141 0.072802 0.072802\n", "min 384.000000 0.000000 0.000000\n", "25% 1420.000000 0.078747 0.078747\n", "50% 1920.000000 0.116753 0.116753\n", "75% 2570.000000 0.166160 0.166160\n", "max 13540.000000 1.000000 1.000000" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "x = house_df['sqft_living']\n", "x_new = (x-x.min())/(x.max()-x.min())\n", "\n", "# reshaping x to be a n by 1 matrix since that's how scikit learn likes data for normalization\n", "x_reshaped = np.array(x).reshape(-1,1)\n", "x_new_sklearn = MinMaxScaler().fit_transform(x_reshaped)\n", "\n", "# Plotting the histogram of the variable before normalization\n", "fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(24,5))\n", "ax = ax.ravel()\n", "\n", "ax[0].hist(x, bins=100)\n", "ax[0].set_title('Histogram of sqft_living before normalization')\n", "\n", "ax[1].hist(x_new, bins=100)\n", "ax[1].set_title('Manually normalizing sqft_living')\n", "\n", "ax[2].hist(x_new_sklearn, bins=100)\n", "ax[2].set_title('Normalizing sqft_living using scikit learn');\n", "\n", "# making things a dataframe to check if they work\n", "pd.DataFrame({'x': x, 'x_new_manual': x_new, 'x_new_sklearn': x_new_sklearn.flatten()}).describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The million dollar question**\n", "\n", "Should I standardize or normalize my data? [This](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc), [this](https://medium.com/@swethalakshmanan14/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff) and [this](https://stackoverflow.com/questions/32108179/linear-regression-normalization-vs-standardization) are useful resources that I highly recommend. But in a nutshell, what they say is the following: \n", "\n", "**Pros of Normalization**\n", "\n", "1. Normalization (which makes your data go from 0-1) is widely used in image processing and computer vision, where pixel intensities are non-negative and are typically scaled from a 0-255 scale to a 0-1 range for a lot of different algorithms. \n", "2. Normalization is also very useful in neural networks (which we will see later in the course) as it leads to the algorithms converging faster.\n", "3. Normalization is useful when your data does not have a discernible distribution and you are not making assumptions about your data's distribution.\n", "\n", "**Pros of Standardization**\n", "\n", "1. Standardization maintains outliers (do you see why?) whereas normalization makes outliers less obvious. In applications where outliers are useful, standardization should be done.\n", "2. Standardization is useful when you assume your data comes from a Gaussian distribution (or something that is approximately Gaussian). \n", "\n", "**Some General Advice**\n", "\n", "1. We learn parameters for standardization ($\\mu$ and $\\sigma$) and for normalization ($x_{min}$ and $x_{max}$). Make sure these parameters are learned on the training set i.e use the training set parameters even when normalizing/standardizing the test set. In sklearn terms, fit your scaler on the training set and use the scaler to transform your test set and validation set (**don't re-fit your scaler on test set data!**).\n", "2. The point of standardization and normalization is to make your variables take on a more manageable scale. You should ideally standardize or normalize all your variables at the same time. \n", "3. Standardization and normalization is not always needed and is not an automatic thing you have to do on any data science homework!! Do so sparingly and try to justify why this is needed.\n", "\n", "**Interpreting Coefficients**\n", "\n", "A great quote from [here](https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia)\n", "\n", "> [Standardization] makes it so the intercept term is interpreted as the expected value of 𝑌𝑖 when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of 𝑌𝑖 when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?)\n", "\n", "### Standardizing our Design Matrix" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlong
count2.400000e+032.400000e+032.400000e+032.400000e+032.400000e+032.400000e+032.400000e+032.400000e+032.400000e+03
mean2.250977e-164.503342e-171.471971e-16-2.406640e-172.680263e-16-8.234154e-18-1.709281e-164.928733e-144.897231e-14
std1.000208e+001.000208e+001.000208e+001.000208e+001.000208e+001.000208e+001.000208e+001.000208e+001.000208e+00
min-3.618993e+00-2.677207e+00-1.766429e+00-3.364203e-01-8.897850e-01-1.636285e+00-6.704685e-01-2.937091e+00-2.084576e+00
25%-4.009185e-01-4.598398e-01-7.087691e-01-2.324570e-01-8.897850e-01-7.089826e-01-6.704685e-01-6.732889e-01-8.086270e-01
50%-4.009185e-011.736938e-01-1.933403e-01-1.774091e-01-8.897850e-01-2.895998e-01-6.704685e-018.468878e-02-1.278830e-01
75%6.717731e-014.904606e-014.973342e-01-1.033061e-019.975186e-015.375162e-016.315842e-018.607566e-016.455277e-01
max7.107923e+007.459330e+001.179553e+011.945618e+013.828474e+008.878574e+008.291994e+001.560846e+006.062967e+00
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors \\\n", "count 2.400000e+03 2.400000e+03 2.400000e+03 2.400000e+03 2.400000e+03 \n", "mean 2.250977e-16 4.503342e-17 1.471971e-16 -2.406640e-17 2.680263e-16 \n", "std 1.000208e+00 1.000208e+00 1.000208e+00 1.000208e+00 1.000208e+00 \n", "min -3.618993e+00 -2.677207e+00 -1.766429e+00 -3.364203e-01 -8.897850e-01 \n", "25% -4.009185e-01 -4.598398e-01 -7.087691e-01 -2.324570e-01 -8.897850e-01 \n", "50% -4.009185e-01 1.736938e-01 -1.933403e-01 -1.774091e-01 -8.897850e-01 \n", "75% 6.717731e-01 4.904606e-01 4.973342e-01 -1.033061e-01 9.975186e-01 \n", "max 7.107923e+00 7.459330e+00 1.179553e+01 1.945618e+01 3.828474e+00 \n", "\n", " sqft_above sqft_basement lat long \n", "count 2.400000e+03 2.400000e+03 2.400000e+03 2.400000e+03 \n", "mean -8.234154e-18 -1.709281e-16 4.928733e-14 4.897231e-14 \n", "std 1.000208e+00 1.000208e+00 1.000208e+00 1.000208e+00 \n", "min -1.636285e+00 -6.704685e-01 -2.937091e+00 -2.084576e+00 \n", "25% -7.089826e-01 -6.704685e-01 -6.732889e-01 -8.086270e-01 \n", "50% -2.895998e-01 -6.704685e-01 8.468878e-02 -1.278830e-01 \n", "75% 5.375162e-01 6.315842e-01 8.607566e-01 6.455277e-01 \n", "max 8.878574e+00 8.291994e+00 1.560846e+00 6.062967e+00 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlong
count800.000000800.000000800.000000800.000000800.000000800.000000800.000000800.000000800.000000
mean0.0187720.0414440.0244010.0165060.0267370.044415-0.031370-0.0560590.016900
std0.9826830.9975940.9890791.0740790.9916450.9938070.9996381.0080101.028649
min-2.546302-1.726907-1.626232-0.328715-0.889785-1.477851-0.670469-2.693960-2.141602
25%-0.400918-0.459840-0.677843-0.234254-0.889785-0.685684-0.670469-0.737509-0.815755
50%-0.4009180.173694-0.172723-0.1775210.053867-0.266301-0.6704690.031504-0.088678
75%0.6717730.4904610.487026-0.1135330.9975190.5957640.5502060.8173400.597412
max8.1806144.9251957.32161121.7165932.8848225.1390785.8397951.5543336.369480
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors \\\n", "count 800.000000 800.000000 800.000000 800.000000 800.000000 \n", "mean 0.018772 0.041444 0.024401 0.016506 0.026737 \n", "std 0.982683 0.997594 0.989079 1.074079 0.991645 \n", "min -2.546302 -1.726907 -1.626232 -0.328715 -0.889785 \n", "25% -0.400918 -0.459840 -0.677843 -0.234254 -0.889785 \n", "50% -0.400918 0.173694 -0.172723 -0.177521 0.053867 \n", "75% 0.671773 0.490461 0.487026 -0.113533 0.997519 \n", "max 8.180614 4.925195 7.321611 21.716593 2.884822 \n", "\n", " sqft_above sqft_basement lat long \n", "count 800.000000 800.000000 800.000000 800.000000 \n", "mean 0.044415 -0.031370 -0.056059 0.016900 \n", "std 0.993807 0.999638 1.008010 1.028649 \n", "min -1.477851 -0.670469 -2.693960 -2.141602 \n", "25% -0.685684 -0.670469 -0.737509 -0.815755 \n", "50% -0.266301 -0.670469 0.031504 -0.088678 \n", "75% 0.595764 0.550206 0.817340 0.597412 \n", "max 5.139078 5.839795 1.554333 6.369480 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlong
count800.000000800.000000800.000000800.000000800.000000800.000000800.000000800.000000800.000000
mean0.010727-0.018346-0.0290800.0528080.006684-0.021866-0.020484-0.0056310.000897
std0.9654220.9631620.9463671.5696191.0125870.9558050.9387931.0228301.028155
min-3.618993-2.677207-1.657158-0.335005-0.889785-1.524449-0.670469-2.656332-1.778063
25%-0.400918-0.459840-0.701038-0.229160-0.889785-0.708983-0.670469-0.624084-0.789024
50%-0.4009180.173694-0.183032-0.174051-0.889785-0.295425-0.6704690.174054-0.174216
75%0.6717730.4904610.417443-0.1006830.9975190.4792690.5881820.8421240.604540
max4.9625403.6581285.78563336.7468093.8284745.0109334.0169211.5449266.412249
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors \\\n", "count 800.000000 800.000000 800.000000 800.000000 800.000000 \n", "mean 0.010727 -0.018346 -0.029080 0.052808 0.006684 \n", "std 0.965422 0.963162 0.946367 1.569619 1.012587 \n", "min -3.618993 -2.677207 -1.657158 -0.335005 -0.889785 \n", "25% -0.400918 -0.459840 -0.701038 -0.229160 -0.889785 \n", "50% -0.400918 0.173694 -0.183032 -0.174051 -0.889785 \n", "75% 0.671773 0.490461 0.417443 -0.100683 0.997519 \n", "max 4.962540 3.658128 5.785633 36.746809 3.828474 \n", "\n", " sqft_above sqft_basement lat long \n", "count 800.000000 800.000000 800.000000 800.000000 \n", "mean -0.021866 -0.020484 -0.005631 0.000897 \n", "std 0.955805 0.938793 1.022830 1.028155 \n", "min -1.524449 -0.670469 -2.656332 -1.778063 \n", "25% -0.708983 -0.670469 -0.624084 -0.789024 \n", "50% -0.295425 -0.670469 0.174054 -0.174216 \n", "75% 0.479269 0.588182 0.842124 0.604540 \n", "max 5.010933 4.016921 1.544926 6.412249 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement',\n", " 'lat', 'long']\n", "\n", "X_train = train_df[features]\n", "y_train = np.array(train_df['price']).reshape(-1,1)\n", "\n", "X_val = val_df[features]\n", "y_val = np.array(val_df['price']).reshape(-1,1)\n", "\n", "X_test = test_df[features]\n", "y_test = np.array(test_df['price']).reshape(-1,1)\n", "\n", "scaler = StandardScaler().fit(X_train)\n", "\n", "# This converts our matrices into numpy matrices\n", "X_train_t = scaler.transform(X_train)\n", "X_val_t = scaler.transform(X_val)\n", "X_test_t = scaler.transform(X_test)\n", "\n", "# Making the numpy matrices pandas dataframes\n", "X_train_df = pd.DataFrame(X_train_t, columns=features)\n", "X_val_df = pd.DataFrame(X_val_t, columns=features)\n", "X_test_df = pd.DataFrame(X_test_t, columns=features)\n", "\n", "display(X_train_df.describe())\n", "display(X_val_df.describe())\n", "display(X_test_df.describe())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler().fit(y_train)\n", "y_train = scaler.transform(y_train)\n", "y_val = scaler.transform(y_val)\n", "y_test = scaler.transform(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-Degree Polynomial Model" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.586
Model: OLS Adj. R-squared: 0.584
Method: Least Squares F-statistic: 422.3
Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00
Time: 15:13:28 Log-Likelihood: -2348.5
No. Observations: 2400 AIC: 4715.
Df Residuals: 2391 BIC: 4767.
Df Model: 8
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const -5.145e-15 0.013 -3.91e-13 1.000 -0.026 0.026
bedrooms -0.1592 0.017 -9.505 0.000 -0.192 -0.126
bathrooms 0.0422 0.022 1.914 0.056 -0.001 0.085
sqft_living 0.4011 0.011 36.238 0.000 0.379 0.423
sqft_lot -0.0058 0.014 -0.420 0.675 -0.033 0.021
floors -0.0470 0.017 -2.690 0.007 -0.081 -0.013
sqft_above 0.3866 0.013 30.254 0.000 0.362 0.412
sqft_basement 0.1242 0.014 8.651 0.000 0.096 0.152
lat 0.2414 0.013 17.983 0.000 0.215 0.268
long -0.1388 0.014 -9.605 0.000 -0.167 -0.110
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1646.401 Durbin-Watson: 2.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 52596.394
Skew: 2.797 Prob(JB): 0.00
Kurtosis: 25.241 Cond. No. 2.95e+19


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.63e-36. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.586\n", "Model: OLS Adj. R-squared: 0.584\n", "Method: Least Squares F-statistic: 422.3\n", "Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00\n", "Time: 15:13:28 Log-Likelihood: -2348.5\n", "No. Observations: 2400 AIC: 4715.\n", "Df Residuals: 2391 BIC: 4767.\n", "Df Model: 8 \n", "Covariance Type: nonrobust \n", "=================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "---------------------------------------------------------------------------------\n", "const -5.145e-15 0.013 -3.91e-13 1.000 -0.026 0.026\n", "bedrooms -0.1592 0.017 -9.505 0.000 -0.192 -0.126\n", "bathrooms 0.0422 0.022 1.914 0.056 -0.001 0.085\n", "sqft_living 0.4011 0.011 36.238 0.000 0.379 0.423\n", "sqft_lot -0.0058 0.014 -0.420 0.675 -0.033 0.021\n", "floors -0.0470 0.017 -2.690 0.007 -0.081 -0.013\n", "sqft_above 0.3866 0.013 30.254 0.000 0.362 0.412\n", "sqft_basement 0.1242 0.014 8.651 0.000 0.096 0.152\n", "lat 0.2414 0.013 17.983 0.000 0.215 0.268\n", "long -0.1388 0.014 -9.605 0.000 -0.167 -0.110\n", "==============================================================================\n", "Omnibus: 1646.401 Durbin-Watson: 2.009\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 52596.394\n", "Skew: 2.797 Prob(JB): 0.00\n", "Kurtosis: 25.241 Cond. No. 2.95e+19\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The smallest eigenvalue is 9.63e-36. This might indicate that there are\n", "strong multicollinearity problems or that the design matrix is singular.\n", "\"\"\"" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statsmodels.api as sm\n", "from statsmodels.regression.linear_model import OLS\n", "\n", "model_1 = OLS(np.array(y_train).reshape(-1,1), sm.add_constant(X_train_df)).fit()\n", "model_1.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Two-Degree Polynomial Model" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 9) (2400, 18)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlongbedrooms^2bathrooms^2sqft_living^2sqft_lot^2floors^2sqft_above^2sqft_basement^2lat^2long^2
0-0.400918-0.459840-0.533523-0.184294-0.889785-0.243002-0.670469-0.261919-1.179294-0.462425-0.498149-0.435619-0.081332-0.820725-0.317640-0.429442-0.2634511.180094
1-0.4009181.1239940.919986-0.1297290.9975191.399581-0.6704690.5253650.289117-0.4624250.9626230.551247-0.0797730.8820971.104202-0.4294420.524670-0.289785
20.6717730.490461-0.049020-0.167446-0.889785-0.8604261.4996190.7207390.5457330.5331840.286055-0.174327-0.080898-0.820725-0.6252130.9657460.720531-0.546402
3-0.4009180.490461-0.121180-0.035583-0.8897850.222979-0.6704690.066599-0.088678-0.4624250.286055-0.217531-0.076044-0.820725-0.003426-0.4294420.0651970.088151
4-0.4009180.490461-0.327352-0.187215-0.889785-0.4526930.154165-1.4117290.232092-0.4624250.286055-0.332701-0.081403-0.820725-0.436000-0.227977-1.411246-0.232748
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors sqft_above \\\n", "0 -0.400918 -0.459840 -0.533523 -0.184294 -0.889785 -0.243002 \n", "1 -0.400918 1.123994 0.919986 -0.129729 0.997519 1.399581 \n", "2 0.671773 0.490461 -0.049020 -0.167446 -0.889785 -0.860426 \n", "3 -0.400918 0.490461 -0.121180 -0.035583 -0.889785 0.222979 \n", "4 -0.400918 0.490461 -0.327352 -0.187215 -0.889785 -0.452693 \n", "\n", " sqft_basement lat long bedrooms^2 bathrooms^2 sqft_living^2 \\\n", "0 -0.670469 -0.261919 -1.179294 -0.462425 -0.498149 -0.435619 \n", "1 -0.670469 0.525365 0.289117 -0.462425 0.962623 0.551247 \n", "2 1.499619 0.720739 0.545733 0.533184 0.286055 -0.174327 \n", "3 -0.670469 0.066599 -0.088678 -0.462425 0.286055 -0.217531 \n", "4 0.154165 -1.411729 0.232092 -0.462425 0.286055 -0.332701 \n", "\n", " sqft_lot^2 floors^2 sqft_above^2 sqft_basement^2 lat^2 long^2 \n", "0 -0.081332 -0.820725 -0.317640 -0.429442 -0.263451 1.180094 \n", "1 -0.079773 0.882097 1.104202 -0.429442 0.524670 -0.289785 \n", "2 -0.080898 -0.820725 -0.625213 0.965746 0.720531 -0.546402 \n", "3 -0.076044 -0.820725 -0.003426 -0.429442 0.065197 0.088151 \n", "4 -0.081403 -0.820725 -0.436000 -0.227977 -1.411246 -0.232748 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_square_terms(df):\n", " df = df.copy()\n", " cols = df.columns.copy()\n", " for col in cols:\n", " df['{}^2'.format(col)] = df[col]**2\n", " return df\n", "\n", "X_train_df_2 = add_square_terms(X_train)\n", "X_val_df_2 = add_square_terms(X_val)\n", "\n", "# Standardizing our added coefficients\n", "cols = X_train_df_2.columns\n", "scaler = StandardScaler().fit(X_train_df_2)\n", "X_train_df_2 = pd.DataFrame(scaler.transform(X_train_df_2), columns=cols)\n", "X_val_df_2 = pd.DataFrame(scaler.transform(X_val_df_2), columns=cols)\n", "\n", "print(X_train_df.shape, X_train_df_2.shape)\n", "\n", "# Also check using the describe() function that the mean and standard deviations are the way we want them\n", "X_train_df_2.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.612
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 220.8
Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00
Time: 15:13:28 Log-Likelihood: -2269.9
No. Observations: 2400 AIC: 4576.
Df Residuals: 2382 BIC: 4680.
Df Model: 17
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const -6.175e-12 0.013 -4.84e-10 1.000 -0.025 0.025
bedrooms -0.1271 0.058 -2.186 0.029 -0.241 -0.013
bathrooms 0.1537 0.060 2.569 0.010 0.036 0.271
sqft_living 0.3406 0.026 12.895 0.000 0.289 0.392
sqft_lot -0.0278 0.030 -0.920 0.358 -0.087 0.032
floors -0.1006 0.087 -1.151 0.250 -0.272 0.071
sqft_above 0.2460 0.036 6.809 0.000 0.175 0.317
sqft_basement 0.2587 0.033 7.758 0.000 0.193 0.324
lat 83.5852 8.613 9.705 0.000 66.696 100.474
long -7.0103 16.124 -0.435 0.664 -38.628 24.608
bedrooms^2 -0.0117 0.057 -0.207 0.836 -0.123 0.099
bathrooms^2 -0.1395 0.061 -2.293 0.022 -0.259 -0.020
sqft_living^2 0.2606 0.104 2.498 0.013 0.056 0.465
sqft_lot^2 0.0395 0.029 1.366 0.172 -0.017 0.096
floors^2 0.0449 0.083 0.539 0.590 -0.118 0.208
sqft_above^2 0.0384 0.105 0.366 0.714 -0.167 0.244
sqft_basement^2 -0.2640 0.049 -5.424 0.000 -0.359 -0.169
lat^2 -83.3483 8.612 -9.678 0.000 -100.237 -66.460
long^2 -6.8786 16.124 -0.427 0.670 -38.498 24.741
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1594.128 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41401.592
Skew: 2.739 Prob(JB): 0.00
Kurtosis: 22.596 Cond. No. 2.09e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.68e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.612\n", "Model: OLS Adj. R-squared: 0.609\n", "Method: Least Squares F-statistic: 220.8\n", "Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00\n", "Time: 15:13:28 Log-Likelihood: -2269.9\n", "No. Observations: 2400 AIC: 4576.\n", "Df Residuals: 2382 BIC: 4680.\n", "Df Model: 17 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "const -6.175e-12 0.013 -4.84e-10 1.000 -0.025 0.025\n", "bedrooms -0.1271 0.058 -2.186 0.029 -0.241 -0.013\n", "bathrooms 0.1537 0.060 2.569 0.010 0.036 0.271\n", "sqft_living 0.3406 0.026 12.895 0.000 0.289 0.392\n", "sqft_lot -0.0278 0.030 -0.920 0.358 -0.087 0.032\n", "floors -0.1006 0.087 -1.151 0.250 -0.272 0.071\n", "sqft_above 0.2460 0.036 6.809 0.000 0.175 0.317\n", "sqft_basement 0.2587 0.033 7.758 0.000 0.193 0.324\n", "lat 83.5852 8.613 9.705 0.000 66.696 100.474\n", "long -7.0103 16.124 -0.435 0.664 -38.628 24.608\n", "bedrooms^2 -0.0117 0.057 -0.207 0.836 -0.123 0.099\n", "bathrooms^2 -0.1395 0.061 -2.293 0.022 -0.259 -0.020\n", "sqft_living^2 0.2606 0.104 2.498 0.013 0.056 0.465\n", "sqft_lot^2 0.0395 0.029 1.366 0.172 -0.017 0.096\n", "floors^2 0.0449 0.083 0.539 0.590 -0.118 0.208\n", "sqft_above^2 0.0384 0.105 0.366 0.714 -0.167 0.244\n", "sqft_basement^2 -0.2640 0.049 -5.424 0.000 -0.359 -0.169\n", "lat^2 -83.3483 8.612 -9.678 0.000 -100.237 -66.460\n", "long^2 -6.8786 16.124 -0.427 0.670 -38.498 24.741\n", "==============================================================================\n", "Omnibus: 1594.128 Durbin-Watson: 2.011\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 41401.592\n", "Skew: 2.739 Prob(JB): 0.00\n", "Kurtosis: 22.596 Cond. No. 2.09e+16\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The smallest eigenvalue is 3.68e-29. This might indicate that there are\n", "strong multicollinearity problems or that the design matrix is singular.\n", "\"\"\"" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_2 = OLS(np.array(y_train).reshape(-1,1), sm.add_constant(X_train_df_2)).fit()\n", "model_2.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Three-Degree Polynomial Model" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 9) (2400, 27)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlongbedrooms^2bedrooms^3bathrooms^2bathrooms^3sqft_living^2sqft_living^3sqft_lot^2sqft_lot^3floors^2floors^3sqft_above^2sqft_above^3sqft_basement^2sqft_basement^3lat^2lat^3long^2long^3
0-0.400918-0.459840-0.533523-0.184294-0.889785-0.243002-0.670469-0.261919-1.179294-0.462425-0.433878-0.498149-0.388130-0.435619-0.219568-0.081332-0.052800-0.820725-0.716812-0.317640-0.259891-0.429442-0.212933-0.263451-0.2649821.180094-1.180893
1-0.4009181.1239940.919986-0.1297290.9975191.399581-0.6704690.5253650.289117-0.462425-0.4338780.9626230.6096670.5512470.166351-0.079773-0.0527740.8820970.7087721.1042020.616150-0.429442-0.2129330.5246700.523971-0.2897850.290452
20.6717730.490461-0.049020-0.167446-0.889785-0.8604261.4996190.7207390.5457330.5331840.3425440.2860550.085193-0.174327-0.140462-0.080898-0.052793-0.820725-0.716812-0.625213-0.3670270.9657460.3415780.7205310.720318-0.5464020.547071
3-0.4009180.490461-0.121180-0.035583-0.8897850.222979-0.6704690.066599-0.088678-0.462425-0.4338780.2860550.085193-0.217531-0.154904-0.076044-0.052686-0.820725-0.716812-0.003426-0.113103-0.429442-0.2129330.0651970.0637930.088151-0.087625
4-0.4009180.490461-0.327352-0.187215-0.889785-0.4526930.154165-1.4117290.232092-0.462425-0.4338780.2860550.085193-0.332701-0.190853-0.081403-0.052801-0.820725-0.716812-0.436000-0.306038-0.227977-0.182506-1.411246-1.410757-0.2327480.233404
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors sqft_above \\\n", "0 -0.400918 -0.459840 -0.533523 -0.184294 -0.889785 -0.243002 \n", "1 -0.400918 1.123994 0.919986 -0.129729 0.997519 1.399581 \n", "2 0.671773 0.490461 -0.049020 -0.167446 -0.889785 -0.860426 \n", "3 -0.400918 0.490461 -0.121180 -0.035583 -0.889785 0.222979 \n", "4 -0.400918 0.490461 -0.327352 -0.187215 -0.889785 -0.452693 \n", "\n", " sqft_basement lat long bedrooms^2 bedrooms^3 bathrooms^2 \\\n", "0 -0.670469 -0.261919 -1.179294 -0.462425 -0.433878 -0.498149 \n", "1 -0.670469 0.525365 0.289117 -0.462425 -0.433878 0.962623 \n", "2 1.499619 0.720739 0.545733 0.533184 0.342544 0.286055 \n", "3 -0.670469 0.066599 -0.088678 -0.462425 -0.433878 0.286055 \n", "4 0.154165 -1.411729 0.232092 -0.462425 -0.433878 0.286055 \n", "\n", " bathrooms^3 sqft_living^2 sqft_living^3 sqft_lot^2 sqft_lot^3 \\\n", "0 -0.388130 -0.435619 -0.219568 -0.081332 -0.052800 \n", "1 0.609667 0.551247 0.166351 -0.079773 -0.052774 \n", "2 0.085193 -0.174327 -0.140462 -0.080898 -0.052793 \n", "3 0.085193 -0.217531 -0.154904 -0.076044 -0.052686 \n", "4 0.085193 -0.332701 -0.190853 -0.081403 -0.052801 \n", "\n", " floors^2 floors^3 sqft_above^2 sqft_above^3 sqft_basement^2 \\\n", "0 -0.820725 -0.716812 -0.317640 -0.259891 -0.429442 \n", "1 0.882097 0.708772 1.104202 0.616150 -0.429442 \n", "2 -0.820725 -0.716812 -0.625213 -0.367027 0.965746 \n", "3 -0.820725 -0.716812 -0.003426 -0.113103 -0.429442 \n", "4 -0.820725 -0.716812 -0.436000 -0.306038 -0.227977 \n", "\n", " sqft_basement^3 lat^2 lat^3 long^2 long^3 \n", "0 -0.212933 -0.263451 -0.264982 1.180094 -1.180893 \n", "1 -0.212933 0.524670 0.523971 -0.289785 0.290452 \n", "2 0.341578 0.720531 0.720318 -0.546402 0.547071 \n", "3 -0.212933 0.065197 0.063793 0.088151 -0.087625 \n", "4 -0.182506 -1.411246 -1.410757 -0.232748 0.233404 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# generalizing our function from above\n", "def add_square_and_cube_terms(df):\n", " df = df.copy()\n", " cols = df.columns.copy()\n", " for col in cols:\n", " df['{}^2'.format(col)] = df[col]**2\n", " df['{}^3'.format(col)] = df[col]**3\n", " return df\n", "\n", "X_train_df_3 = add_square_and_cube_terms(X_train)\n", "X_val_df_3 = add_square_and_cube_terms(X_val)\n", "\n", "# Standardizing our added coefficients\n", "cols = X_train_df_3.columns\n", "scaler = StandardScaler().fit(X_train_df_3)\n", "X_train_df_3 = pd.DataFrame(scaler.transform(X_train_df_3), columns=cols)\n", "X_val_df_3 = pd.DataFrame(scaler.transform(X_val_df_3), columns=cols)\n", "\n", "print(X_train_df.shape, X_train_df_3.shape)\n", "\n", "# Also check using the describe() function that the mean and standard deviations are the way we want them\n", "X_train_df_3.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.698
Model: OLS Adj. R-squared: 0.695
Method: Least Squares F-statistic: 211.2
Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00
Time: 15:13:28 Log-Likelihood: -1967.6
No. Observations: 2400 AIC: 3989.
Df Residuals: 2373 BIC: 4145.
Df Model: 26
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 6.291e-09 0.011 5.58e-07 1.000 -0.022 0.022
bedrooms 0.2510 0.120 2.094 0.036 0.016 0.486
bathrooms -0.2267 0.120 -1.893 0.058 -0.461 0.008
sqft_living 0.2066 0.057 3.622 0.000 0.095 0.319
sqft_lot 0.0520 0.047 1.098 0.272 -0.041 0.145
floors 1.1766 0.624 1.886 0.059 -0.047 2.400
sqft_above 0.4473 0.078 5.715 0.000 0.294 0.601
sqft_basement -0.3982 0.060 -6.640 0.000 -0.516 -0.281
lat -6.976e+04 3592.874 -19.417 0.000 -7.68e+04 -6.27e+04
long 3.138e+04 8572.131 3.661 0.000 1.46e+04 4.82e+04
bedrooms^2 -0.5999 0.218 -2.757 0.006 -1.027 -0.173
bedrooms^3 0.2824 0.111 2.535 0.011 0.064 0.501
bathrooms^2 0.8180 0.216 3.779 0.000 0.394 1.243
bathrooms^3 -0.6577 0.119 -5.531 0.000 -0.891 -0.425
sqft_living^2 2.2922 0.273 8.394 0.000 1.757 2.828
sqft_living^3 -1.2107 0.219 -5.535 0.000 -1.640 -0.782
sqft_lot^2 -0.1480 0.123 -1.207 0.227 -0.388 0.092
sqft_lot^3 0.1319 0.088 1.491 0.136 -0.042 0.305
floors^2 -2.3337 1.147 -2.034 0.042 -4.583 -0.084
floors^3 1.1113 0.543 2.048 0.041 0.047 2.175
sqft_above^2 -2.1768 0.351 -6.194 0.000 -2.866 -1.488
sqft_above^3 1.3407 0.228 5.876 0.000 0.893 1.788
sqft_basement^2 0.0363 0.149 0.243 0.808 -0.256 0.328
sqft_basement^3 -0.1946 0.126 -1.546 0.122 -0.441 0.052
lat^2 1.397e+05 7188.240 19.429 0.000 1.26e+05 1.54e+05
lat^3 -6.99e+04 3595.377 -19.441 0.000 -7.69e+04 -6.28e+04
long^2 6.284e+04 1.72e+04 3.663 0.000 2.92e+04 9.65e+04
long^3 3.145e+04 8583.990 3.664 0.000 1.46e+04 4.83e+04
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1337.688 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29677.038
Skew: 2.170 Prob(JB): 0.00
Kurtosis: 19.672 Cond. No. 2.25e+15


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.47e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.698\n", "Model: OLS Adj. R-squared: 0.695\n", "Method: Least Squares F-statistic: 211.2\n", "Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00\n", "Time: 15:13:28 Log-Likelihood: -1967.6\n", "No. Observations: 2400 AIC: 3989.\n", "Df Residuals: 2373 BIC: 4145.\n", "Df Model: 26 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "const 6.291e-09 0.011 5.58e-07 1.000 -0.022 0.022\n", "bedrooms 0.2510 0.120 2.094 0.036 0.016 0.486\n", "bathrooms -0.2267 0.120 -1.893 0.058 -0.461 0.008\n", "sqft_living 0.2066 0.057 3.622 0.000 0.095 0.319\n", "sqft_lot 0.0520 0.047 1.098 0.272 -0.041 0.145\n", "floors 1.1766 0.624 1.886 0.059 -0.047 2.400\n", "sqft_above 0.4473 0.078 5.715 0.000 0.294 0.601\n", "sqft_basement -0.3982 0.060 -6.640 0.000 -0.516 -0.281\n", "lat -6.976e+04 3592.874 -19.417 0.000 -7.68e+04 -6.27e+04\n", "long 3.138e+04 8572.131 3.661 0.000 1.46e+04 4.82e+04\n", "bedrooms^2 -0.5999 0.218 -2.757 0.006 -1.027 -0.173\n", "bedrooms^3 0.2824 0.111 2.535 0.011 0.064 0.501\n", "bathrooms^2 0.8180 0.216 3.779 0.000 0.394 1.243\n", "bathrooms^3 -0.6577 0.119 -5.531 0.000 -0.891 -0.425\n", "sqft_living^2 2.2922 0.273 8.394 0.000 1.757 2.828\n", "sqft_living^3 -1.2107 0.219 -5.535 0.000 -1.640 -0.782\n", "sqft_lot^2 -0.1480 0.123 -1.207 0.227 -0.388 0.092\n", "sqft_lot^3 0.1319 0.088 1.491 0.136 -0.042 0.305\n", "floors^2 -2.3337 1.147 -2.034 0.042 -4.583 -0.084\n", "floors^3 1.1113 0.543 2.048 0.041 0.047 2.175\n", "sqft_above^2 -2.1768 0.351 -6.194 0.000 -2.866 -1.488\n", "sqft_above^3 1.3407 0.228 5.876 0.000 0.893 1.788\n", "sqft_basement^2 0.0363 0.149 0.243 0.808 -0.256 0.328\n", "sqft_basement^3 -0.1946 0.126 -1.546 0.122 -0.441 0.052\n", "lat^2 1.397e+05 7188.240 19.429 0.000 1.26e+05 1.54e+05\n", "lat^3 -6.99e+04 3595.377 -19.441 0.000 -7.69e+04 -6.28e+04\n", "long^2 6.284e+04 1.72e+04 3.663 0.000 2.92e+04 9.65e+04\n", "long^3 3.145e+04 8583.990 3.664 0.000 1.46e+04 4.83e+04\n", "==============================================================================\n", "Omnibus: 1337.688 Durbin-Watson: 1.993\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 29677.038\n", "Skew: 2.170 Prob(JB): 0.00\n", "Kurtosis: 19.672 Cond. No. 2.25e+15\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The smallest eigenvalue is 4.47e-27. This might indicate that there are\n", "strong multicollinearity problems or that the design matrix is singular.\n", "\"\"\"" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_3 = OLS(np.array(y_train).reshape(-1,1), sm.add_constant(X_train_df_3)).fit()\n", "model_3.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## N-Degree Polynomial Model" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 9) (2400, 72)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedroomsbathroomssqft_livingsqft_lotfloorssqft_abovesqft_basementlatlongbedrooms^2bedrooms^3bedrooms^4bedrooms^5bedrooms^6bedrooms^7bedrooms^8bathrooms^2bathrooms^3bathrooms^4bathrooms^5bathrooms^6bathrooms^7bathrooms^8sqft_living^2sqft_living^3sqft_living^4sqft_living^5sqft_living^6sqft_living^7sqft_living^8sqft_lot^2sqft_lot^3sqft_lot^4sqft_lot^5sqft_lot^6sqft_lot^7sqft_lot^8floors^2floors^3floors^4floors^5floors^6floors^7floors^8sqft_above^2sqft_above^3sqft_above^4sqft_above^5sqft_above^6sqft_above^7sqft_above^8sqft_basement^2sqft_basement^3sqft_basement^4sqft_basement^5sqft_basement^6sqft_basement^7sqft_basement^8lat^2lat^3lat^4lat^5lat^6lat^7lat^8long^2long^3long^4long^5long^6long^7long^8
0-0.400918-0.459840-0.533523-0.184294-0.889785-0.243002-0.670469-0.261919-1.179294-0.462425-0.433878-0.339413-0.231621-0.149138-0.097431-0.067607-0.498149-0.388130-0.241435-0.141594-0.088630-0.062342-0.049025-0.435619-0.219568-0.091216-0.246556-0.658169-1.5010691.377053-0.081332-0.052800-0.1164370.4325390.1235170.4880420.758862-0.820725-0.716812-0.591427-0.470475-0.371825-0.298830-0.247064-0.317640-0.259891-0.157166-0.178065-0.799711-1.5077971.422035-0.429442-0.212933-0.096048-0.050491-0.161975-0.0553120.108540-0.263451-0.264982-0.266511-0.268039-0.269565-0.271089-0.2726121.180094-1.1808931.181692-1.1824901.183286-1.1840821.184877
1-0.4009181.1239940.919986-0.1297290.9975191.399581-0.6704690.5253650.289117-0.462425-0.433878-0.339413-0.231621-0.149138-0.097431-0.0676070.9626230.6096670.2821970.0939360.009656-0.022846-0.0334830.5512470.1663510.0123180.119550-1.053598-1.0212760.389612-0.079773-0.052774-0.111868-0.8102790.8238150.1994801.0008710.8820970.7087720.5056380.3160050.1662000.059351-0.0124321.1042020.6161500.2315210.288260-1.225165-1.0325510.409224-0.429442-0.212933-0.096048-0.050491-0.161975-0.0553120.1085400.5246700.5239710.5232680.5225620.5218510.5211370.520419-0.2897850.290452-0.2911180.291784-0.2924500.293116-0.293781
20.6717730.490461-0.049020-0.167446-0.889785-0.8604261.4996190.7207390.5457330.5331840.3425440.1617190.040880-0.017513-0.038115-0.0419160.2860550.085193-0.024414-0.057144-0.058395-0.051989-0.045573-0.174327-0.140462-0.075159-0.2044410.0010850.687907-0.472558-0.080898-0.052793-0.115420-0.618477-1.4046830.374544-0.388685-0.820725-0.716812-0.591427-0.470475-0.371825-0.298830-0.247064-0.625213-0.367027-0.183625-0.1955070.1031631.0581800.0323490.9657460.3415780.061803-0.0105180.6096241.1727410.7145870.7205310.7203180.7201010.7198800.7196550.7194250.719191-0.5464020.547071-0.5477390.548406-0.5490720.549737-0.550401
3-0.4009180.490461-0.121180-0.035583-0.8897850.222979-0.6704690.066599-0.088678-0.462425-0.433878-0.339413-0.231621-0.149138-0.097431-0.0676070.2860550.085193-0.024414-0.057144-0.058395-0.051989-0.045573-0.217531-0.154904-0.078379-0.2136650.890215-1.6281440.193797-0.076044-0.052686-0.090616-1.003240-1.0149521.084197-0.263787-0.820725-0.716812-0.591427-0.470475-0.371825-0.298830-0.247064-0.003426-0.113103-0.108971-0.1361700.866240-1.6336670.208379-0.429442-0.212933-0.096048-0.050491-0.161975-0.0553120.1085400.0651970.0637930.0623890.0609840.0595770.0581700.0567610.088151-0.0876250.087098-0.0865700.086042-0.0855140.084986
4-0.4009180.490461-0.327352-0.187215-0.889785-0.4526930.154165-1.4117290.232092-0.462425-0.433878-0.339413-0.231621-0.149138-0.097431-0.0676070.2860550.085193-0.024414-0.057144-0.058395-0.051989-0.045573-0.332701-0.190853-0.085868-0.233738-1.0942000.6916541.372554-0.081403-0.052801-0.1165850.054276-0.8466030.175618-1.669502-0.820725-0.716812-0.591427-0.470475-0.371825-0.298830-0.247064-0.436000-0.306038-0.169774-0.1871591.4245081.642941-0.565807-0.227977-0.182506-0.092756-0.050174-0.1596520.307214-2.293748-1.411246-1.410757-1.410261-1.409758-1.409249-1.408733-1.408211-0.2327480.233404-0.2340610.234716-0.2353720.236027-0.236682
\n", "
" ], "text/plain": [ " bedrooms bathrooms sqft_living sqft_lot floors sqft_above \\\n", "0 -0.400918 -0.459840 -0.533523 -0.184294 -0.889785 -0.243002 \n", "1 -0.400918 1.123994 0.919986 -0.129729 0.997519 1.399581 \n", "2 0.671773 0.490461 -0.049020 -0.167446 -0.889785 -0.860426 \n", "3 -0.400918 0.490461 -0.121180 -0.035583 -0.889785 0.222979 \n", "4 -0.400918 0.490461 -0.327352 -0.187215 -0.889785 -0.452693 \n", "\n", " sqft_basement lat long bedrooms^2 bedrooms^3 bedrooms^4 \\\n", "0 -0.670469 -0.261919 -1.179294 -0.462425 -0.433878 -0.339413 \n", "1 -0.670469 0.525365 0.289117 -0.462425 -0.433878 -0.339413 \n", "2 1.499619 0.720739 0.545733 0.533184 0.342544 0.161719 \n", "3 -0.670469 0.066599 -0.088678 -0.462425 -0.433878 -0.339413 \n", "4 0.154165 -1.411729 0.232092 -0.462425 -0.433878 -0.339413 \n", "\n", " bedrooms^5 bedrooms^6 bedrooms^7 bedrooms^8 bathrooms^2 bathrooms^3 \\\n", "0 -0.231621 -0.149138 -0.097431 -0.067607 -0.498149 -0.388130 \n", "1 -0.231621 -0.149138 -0.097431 -0.067607 0.962623 0.609667 \n", "2 0.040880 -0.017513 -0.038115 -0.041916 0.286055 0.085193 \n", "3 -0.231621 -0.149138 -0.097431 -0.067607 0.286055 0.085193 \n", "4 -0.231621 -0.149138 -0.097431 -0.067607 0.286055 0.085193 \n", "\n", " bathrooms^4 bathrooms^5 bathrooms^6 bathrooms^7 bathrooms^8 \\\n", "0 -0.241435 -0.141594 -0.088630 -0.062342 -0.049025 \n", "1 0.282197 0.093936 0.009656 -0.022846 -0.033483 \n", "2 -0.024414 -0.057144 -0.058395 -0.051989 -0.045573 \n", "3 -0.024414 -0.057144 -0.058395 -0.051989 -0.045573 \n", "4 -0.024414 -0.057144 -0.058395 -0.051989 -0.045573 \n", "\n", " sqft_living^2 sqft_living^3 sqft_living^4 sqft_living^5 sqft_living^6 \\\n", "0 -0.435619 -0.219568 -0.091216 -0.246556 -0.658169 \n", "1 0.551247 0.166351 0.012318 0.119550 -1.053598 \n", "2 -0.174327 -0.140462 -0.075159 -0.204441 0.001085 \n", "3 -0.217531 -0.154904 -0.078379 -0.213665 0.890215 \n", "4 -0.332701 -0.190853 -0.085868 -0.233738 -1.094200 \n", "\n", " sqft_living^7 sqft_living^8 sqft_lot^2 sqft_lot^3 sqft_lot^4 \\\n", "0 -1.501069 1.377053 -0.081332 -0.052800 -0.116437 \n", "1 -1.021276 0.389612 -0.079773 -0.052774 -0.111868 \n", "2 0.687907 -0.472558 -0.080898 -0.052793 -0.115420 \n", "3 -1.628144 0.193797 -0.076044 -0.052686 -0.090616 \n", "4 0.691654 1.372554 -0.081403 -0.052801 -0.116585 \n", "\n", " sqft_lot^5 sqft_lot^6 sqft_lot^7 sqft_lot^8 floors^2 floors^3 \\\n", "0 0.432539 0.123517 0.488042 0.758862 -0.820725 -0.716812 \n", "1 -0.810279 0.823815 0.199480 1.000871 0.882097 0.708772 \n", "2 -0.618477 -1.404683 0.374544 -0.388685 -0.820725 -0.716812 \n", "3 -1.003240 -1.014952 1.084197 -0.263787 -0.820725 -0.716812 \n", "4 0.054276 -0.846603 0.175618 -1.669502 -0.820725 -0.716812 \n", "\n", " floors^4 floors^5 floors^6 floors^7 floors^8 sqft_above^2 \\\n", "0 -0.591427 -0.470475 -0.371825 -0.298830 -0.247064 -0.317640 \n", "1 0.505638 0.316005 0.166200 0.059351 -0.012432 1.104202 \n", "2 -0.591427 -0.470475 -0.371825 -0.298830 -0.247064 -0.625213 \n", "3 -0.591427 -0.470475 -0.371825 -0.298830 -0.247064 -0.003426 \n", "4 -0.591427 -0.470475 -0.371825 -0.298830 -0.247064 -0.436000 \n", "\n", " sqft_above^3 sqft_above^4 sqft_above^5 sqft_above^6 sqft_above^7 \\\n", "0 -0.259891 -0.157166 -0.178065 -0.799711 -1.507797 \n", "1 0.616150 0.231521 0.288260 -1.225165 -1.032551 \n", "2 -0.367027 -0.183625 -0.195507 0.103163 1.058180 \n", "3 -0.113103 -0.108971 -0.136170 0.866240 -1.633667 \n", "4 -0.306038 -0.169774 -0.187159 1.424508 1.642941 \n", "\n", " sqft_above^8 sqft_basement^2 sqft_basement^3 sqft_basement^4 \\\n", "0 1.422035 -0.429442 -0.212933 -0.096048 \n", "1 0.409224 -0.429442 -0.212933 -0.096048 \n", "2 0.032349 0.965746 0.341578 0.061803 \n", "3 0.208379 -0.429442 -0.212933 -0.096048 \n", "4 -0.565807 -0.227977 -0.182506 -0.092756 \n", "\n", " sqft_basement^5 sqft_basement^6 sqft_basement^7 sqft_basement^8 \\\n", "0 -0.050491 -0.161975 -0.055312 0.108540 \n", "1 -0.050491 -0.161975 -0.055312 0.108540 \n", "2 -0.010518 0.609624 1.172741 0.714587 \n", "3 -0.050491 -0.161975 -0.055312 0.108540 \n", "4 -0.050174 -0.159652 0.307214 -2.293748 \n", "\n", " lat^2 lat^3 lat^4 lat^5 lat^6 lat^7 lat^8 \\\n", "0 -0.263451 -0.264982 -0.266511 -0.268039 -0.269565 -0.271089 -0.272612 \n", "1 0.524670 0.523971 0.523268 0.522562 0.521851 0.521137 0.520419 \n", "2 0.720531 0.720318 0.720101 0.719880 0.719655 0.719425 0.719191 \n", "3 0.065197 0.063793 0.062389 0.060984 0.059577 0.058170 0.056761 \n", "4 -1.411246 -1.410757 -1.410261 -1.409758 -1.409249 -1.408733 -1.408211 \n", "\n", " long^2 long^3 long^4 long^5 long^6 long^7 long^8 \n", "0 1.180094 -1.180893 1.181692 -1.182490 1.183286 -1.184082 1.184877 \n", "1 -0.289785 0.290452 -0.291118 0.291784 -0.292450 0.293116 -0.293781 \n", "2 -0.546402 0.547071 -0.547739 0.548406 -0.549072 0.549737 -0.550401 \n", "3 0.088151 -0.087625 0.087098 -0.086570 0.086042 -0.085514 0.084986 \n", "4 -0.232748 0.233404 -0.234061 0.234716 -0.235372 0.236027 -0.236682 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# generalizing our function from above\n", "def add_higher_order_polynomial_terms(df, N=7):\n", " df = df.copy()\n", " cols = df.columns.copy()\n", " for col in cols:\n", " for i in range(2, N+1):\n", " df['{}^{}'.format(col, i)] = df[col]**i\n", " return df\n", "\n", "N = 8\n", "X_train_df_N = add_higher_order_polynomial_terms(X_train,N)\n", "X_val_df_N = add_higher_order_polynomial_terms(X_val,N)\n", "\n", "# Standardizing our added coefficients\n", "cols = X_train_df_N.columns\n", "scaler = StandardScaler().fit(X_train_df_N)\n", "X_train_df_N = pd.DataFrame(scaler.transform(X_train_df_N), columns=cols)\n", "X_val_df_N = pd.DataFrame(scaler.transform(X_val_df_N), columns=cols)\n", "\n", "print(X_train_df.shape, X_train_df_N.shape)\n", "\n", "# Also check using the describe() function that the mean and standard deviations are the way we want them\n", "X_train_df_N.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: y R-squared: 0.738
Model: OLS Adj. R-squared: 0.732
Method: Least Squares F-statistic: 106.4
Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00
Time: 15:13:28 Log-Likelihood: -1796.1
No. Observations: 2400 AIC: 3718.
Df Residuals: 2337 BIC: 4083.
Df Model: 62
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 0.0790 0.044 1.788 0.074 -0.008 0.166
bedrooms 0.7485 1.667 0.449 0.653 -2.520 4.017
bathrooms 4.0292 2.090 1.928 0.054 -0.069 8.127
sqft_living -1.825e+10 8.44e+09 -2.161 0.031 -3.48e+10 -1.69e+09
sqft_lot 0.1034 0.047 2.200 0.028 0.011 0.196
floors 5.427e+09 2.53e+09 2.146 0.032 4.67e+08 1.04e+10
sqft_above 1.615e+10 7.47e+09 2.161 0.031 1.49e+09 3.08e+10
sqft_basement 8.668e+09 4.01e+09 2.161 0.031 8.02e+08 1.65e+10
lat -6.58e+11 1.11e+11 -5.907 0.000 -8.76e+11 -4.4e+11
long 1.891e+11 1.71e+11 1.105 0.269 -1.47e+11 5.25e+11
bedrooms^2 -13.8231 16.496 -0.838 0.402 -46.172 18.526
bedrooms^3 85.7910 75.252 1.140 0.254 -61.777 233.359
bedrooms^4 -286.3415 209.397 -1.367 0.172 -696.964 124.281
bedrooms^5 585.7759 382.024 1.533 0.125 -163.365 1334.917
bedrooms^6 -724.1268 437.651 -1.655 0.098 -1582.351 134.097
bedrooms^7 487.2274 279.519 1.743 0.081 -60.903 1035.358
bedrooms^8 -135.4299 74.938 -1.807 0.071 -282.382 11.522
bathrooms^2 -50.0310 17.008 -2.942 0.003 -83.384 -16.678
bathrooms^3 271.5388 78.495 3.459 0.001 117.612 425.465
bathrooms^4 -902.0249 244.292 -3.692 0.000 -1381.076 -422.974
bathrooms^5 1860.8234 493.746 3.769 0.000 892.598 2829.048
bathrooms^6 -2254.0226 602.565 -3.741 0.000 -3435.640 -1072.405
bathrooms^7 1451.5753 399.258 3.636 0.000 668.638 2234.513
bathrooms^8 -381.7826 109.773 -3.478 0.001 -597.045 -166.520
sqft_living^2 3.2337 1.155 2.800 0.005 0.969 5.499
sqft_living^3 -4.1768 2.059 -2.029 0.043 -8.214 -0.139
sqft_living^4 3.5216 1.694 2.079 0.038 0.199 6.844
sqft_living^5 -0.0344 0.018 -1.869 0.062 -0.071 0.002
sqft_living^6 0.0243 0.013 1.815 0.070 -0.002 0.051
sqft_living^7 0.0117 0.014 0.856 0.392 -0.015 0.039
sqft_living^8 -0.0216 0.014 -1.575 0.115 -0.049 0.005
sqft_lot^2 -0.2332 0.121 -1.924 0.054 -0.471 0.004
sqft_lot^3 0.1763 0.087 2.023 0.043 0.005 0.347
sqft_lot^4 -0.0156 0.012 -1.337 0.181 -0.038 0.007
sqft_lot^5 0.0011 0.011 0.101 0.920 -0.020 0.022
sqft_lot^6 0.0004 0.011 0.037 0.970 -0.021 0.021
sqft_lot^7 -0.0104 0.011 -0.965 0.335 -0.031 0.011
sqft_lot^8 0.0107 0.011 0.993 0.321 -0.010 0.032
floors^2 -1.676e+10 7.8e+09 -2.147 0.032 -3.21e+10 -1.45e+09
floors^3 1.529e+10 7.1e+09 2.155 0.031 1.38e+09 2.92e+10
floors^4 4.149e+09 2.27e+09 1.830 0.067 -2.97e+08 8.59e+09
floors^5 -1.176e+10 5.92e+09 -1.986 0.047 -2.34e+10 -1.49e+08
floors^6 -3.924e+09 1.94e+09 -2.024 0.043 -7.73e+09 -1.22e+08
floors^7 1.239e+10 5.63e+09 2.200 0.028 1.34e+09 2.34e+10
floors^8 -4.878e+09 2.24e+09 -2.181 0.029 -9.27e+09 -4.91e+08
sqft_above^2 -1.5414 1.287 -1.198 0.231 -4.065 0.982
sqft_above^3 0.7910 1.763 0.449 0.654 -2.666 4.248
sqft_above^4 0.1489 1.039 0.143 0.886 -1.889 2.187
sqft_above^5 0.0015 0.017 0.090 0.928 -0.031 0.034
sqft_above^6 -0.0184 0.013 -1.368 0.171 -0.045 0.008
sqft_above^7 -0.0146 0.014 -1.066 0.287 -0.041 0.012
sqft_above^8 -0.0068 0.014 -0.496 0.620 -0.034 0.020
sqft_basement^2 1.9238 0.798 2.412 0.016 0.360 3.488
sqft_basement^3 -6.6967 2.368 -2.828 0.005 -11.341 -2.053
sqft_basement^4 12.2596 4.045 3.031 0.002 4.327 20.192
sqft_basement^5 -8.5928 2.694 -3.190 0.001 -13.875 -3.311
sqft_basement^6 0.0144 0.012 1.199 0.231 -0.009 0.038
sqft_basement^7 0.0116 0.011 1.059 0.290 -0.010 0.033
sqft_basement^8 -0.0086 0.011 -0.783 0.433 -0.030 0.013
lat^2 2.588e+12 5.62e+11 4.601 0.000 1.49e+12 3.69e+12
lat^3 -3.378e+12 1.13e+12 -2.991 0.003 -5.59e+12 -1.16e+12
lat^4 1.107e+12 1.04e+12 1.067 0.286 -9.28e+11 3.14e+12
lat^5 6.175e+11 2.07e+11 2.986 0.003 2.12e+11 1.02e+12
lat^6 2.786e+11 4.35e+11 0.640 0.522 -5.75e+11 1.13e+12
lat^7 -8.725e+11 3.57e+11 -2.447 0.014 -1.57e+12 -1.73e+11
lat^8 3.168e+11 8.68e+10 3.651 0.000 1.47e+11 4.87e+11
long^2 4.566e+11 5.54e+11 0.824 0.410 -6.3e+11 1.54e+12
long^3 -1.043e+11 7.18e+11 -0.145 0.885 -1.51e+12 1.3e+12
long^4 -7.335e+11 1.2e+12 -0.613 0.540 -3.08e+12 1.61e+12
long^5 6.222e+11 1.23e+12 0.507 0.612 -1.78e+12 3.03e+12
long^6 2.347e+12 1.36e+12 1.725 0.085 -3.2e+11 5.01e+12
long^7 1.831e+12 1.01e+12 1.809 0.071 -1.54e+11 3.82e+12
long^8 4.676e+11 2.65e+11 1.764 0.078 -5.22e+10 9.87e+11
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1365.789 Durbin-Watson: 2.050
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31897.875
Skew: 2.217 Prob(JB): 0.00
Kurtosis: 20.301 Cond. No. 1.47e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.77e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 0.738\n", "Model: OLS Adj. R-squared: 0.732\n", "Method: Least Squares F-statistic: 106.4\n", "Date: Mon, 07 Oct 2019 Prob (F-statistic): 0.00\n", "Time: 15:13:28 Log-Likelihood: -1796.1\n", "No. Observations: 2400 AIC: 3718.\n", "Df Residuals: 2337 BIC: 4083.\n", "Df Model: 62 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "const 0.0790 0.044 1.788 0.074 -0.008 0.166\n", "bedrooms 0.7485 1.667 0.449 0.653 -2.520 4.017\n", "bathrooms 4.0292 2.090 1.928 0.054 -0.069 8.127\n", "sqft_living -1.825e+10 8.44e+09 -2.161 0.031 -3.48e+10 -1.69e+09\n", "sqft_lot 0.1034 0.047 2.200 0.028 0.011 0.196\n", "floors 5.427e+09 2.53e+09 2.146 0.032 4.67e+08 1.04e+10\n", "sqft_above 1.615e+10 7.47e+09 2.161 0.031 1.49e+09 3.08e+10\n", "sqft_basement 8.668e+09 4.01e+09 2.161 0.031 8.02e+08 1.65e+10\n", "lat -6.58e+11 1.11e+11 -5.907 0.000 -8.76e+11 -4.4e+11\n", "long 1.891e+11 1.71e+11 1.105 0.269 -1.47e+11 5.25e+11\n", "bedrooms^2 -13.8231 16.496 -0.838 0.402 -46.172 18.526\n", "bedrooms^3 85.7910 75.252 1.140 0.254 -61.777 233.359\n", "bedrooms^4 -286.3415 209.397 -1.367 0.172 -696.964 124.281\n", "bedrooms^5 585.7759 382.024 1.533 0.125 -163.365 1334.917\n", "bedrooms^6 -724.1268 437.651 -1.655 0.098 -1582.351 134.097\n", "bedrooms^7 487.2274 279.519 1.743 0.081 -60.903 1035.358\n", "bedrooms^8 -135.4299 74.938 -1.807 0.071 -282.382 11.522\n", "bathrooms^2 -50.0310 17.008 -2.942 0.003 -83.384 -16.678\n", "bathrooms^3 271.5388 78.495 3.459 0.001 117.612 425.465\n", "bathrooms^4 -902.0249 244.292 -3.692 0.000 -1381.076 -422.974\n", "bathrooms^5 1860.8234 493.746 3.769 0.000 892.598 2829.048\n", "bathrooms^6 -2254.0226 602.565 -3.741 0.000 -3435.640 -1072.405\n", "bathrooms^7 1451.5753 399.258 3.636 0.000 668.638 2234.513\n", "bathrooms^8 -381.7826 109.773 -3.478 0.001 -597.045 -166.520\n", "sqft_living^2 3.2337 1.155 2.800 0.005 0.969 5.499\n", "sqft_living^3 -4.1768 2.059 -2.029 0.043 -8.214 -0.139\n", "sqft_living^4 3.5216 1.694 2.079 0.038 0.199 6.844\n", "sqft_living^5 -0.0344 0.018 -1.869 0.062 -0.071 0.002\n", "sqft_living^6 0.0243 0.013 1.815 0.070 -0.002 0.051\n", "sqft_living^7 0.0117 0.014 0.856 0.392 -0.015 0.039\n", "sqft_living^8 -0.0216 0.014 -1.575 0.115 -0.049 0.005\n", "sqft_lot^2 -0.2332 0.121 -1.924 0.054 -0.471 0.004\n", "sqft_lot^3 0.1763 0.087 2.023 0.043 0.005 0.347\n", "sqft_lot^4 -0.0156 0.012 -1.337 0.181 -0.038 0.007\n", "sqft_lot^5 0.0011 0.011 0.101 0.920 -0.020 0.022\n", "sqft_lot^6 0.0004 0.011 0.037 0.970 -0.021 0.021\n", "sqft_lot^7 -0.0104 0.011 -0.965 0.335 -0.031 0.011\n", "sqft_lot^8 0.0107 0.011 0.993 0.321 -0.010 0.032\n", "floors^2 -1.676e+10 7.8e+09 -2.147 0.032 -3.21e+10 -1.45e+09\n", "floors^3 1.529e+10 7.1e+09 2.155 0.031 1.38e+09 2.92e+10\n", "floors^4 4.149e+09 2.27e+09 1.830 0.067 -2.97e+08 8.59e+09\n", "floors^5 -1.176e+10 5.92e+09 -1.986 0.047 -2.34e+10 -1.49e+08\n", "floors^6 -3.924e+09 1.94e+09 -2.024 0.043 -7.73e+09 -1.22e+08\n", "floors^7 1.239e+10 5.63e+09 2.200 0.028 1.34e+09 2.34e+10\n", "floors^8 -4.878e+09 2.24e+09 -2.181 0.029 -9.27e+09 -4.91e+08\n", "sqft_above^2 -1.5414 1.287 -1.198 0.231 -4.065 0.982\n", "sqft_above^3 0.7910 1.763 0.449 0.654 -2.666 4.248\n", "sqft_above^4 0.1489 1.039 0.143 0.886 -1.889 2.187\n", "sqft_above^5 0.0015 0.017 0.090 0.928 -0.031 0.034\n", "sqft_above^6 -0.0184 0.013 -1.368 0.171 -0.045 0.008\n", "sqft_above^7 -0.0146 0.014 -1.066 0.287 -0.041 0.012\n", "sqft_above^8 -0.0068 0.014 -0.496 0.620 -0.034 0.020\n", "sqft_basement^2 1.9238 0.798 2.412 0.016 0.360 3.488\n", "sqft_basement^3 -6.6967 2.368 -2.828 0.005 -11.341 -2.053\n", "sqft_basement^4 12.2596 4.045 3.031 0.002 4.327 20.192\n", "sqft_basement^5 -8.5928 2.694 -3.190 0.001 -13.875 -3.311\n", "sqft_basement^6 0.0144 0.012 1.199 0.231 -0.009 0.038\n", "sqft_basement^7 0.0116 0.011 1.059 0.290 -0.010 0.033\n", "sqft_basement^8 -0.0086 0.011 -0.783 0.433 -0.030 0.013\n", "lat^2 2.588e+12 5.62e+11 4.601 0.000 1.49e+12 3.69e+12\n", "lat^3 -3.378e+12 1.13e+12 -2.991 0.003 -5.59e+12 -1.16e+12\n", "lat^4 1.107e+12 1.04e+12 1.067 0.286 -9.28e+11 3.14e+12\n", "lat^5 6.175e+11 2.07e+11 2.986 0.003 2.12e+11 1.02e+12\n", "lat^6 2.786e+11 4.35e+11 0.640 0.522 -5.75e+11 1.13e+12\n", "lat^7 -8.725e+11 3.57e+11 -2.447 0.014 -1.57e+12 -1.73e+11\n", "lat^8 3.168e+11 8.68e+10 3.651 0.000 1.47e+11 4.87e+11\n", "long^2 4.566e+11 5.54e+11 0.824 0.410 -6.3e+11 1.54e+12\n", "long^3 -1.043e+11 7.18e+11 -0.145 0.885 -1.51e+12 1.3e+12\n", "long^4 -7.335e+11 1.2e+12 -0.613 0.540 -3.08e+12 1.61e+12\n", "long^5 6.222e+11 1.23e+12 0.507 0.612 -1.78e+12 3.03e+12\n", "long^6 2.347e+12 1.36e+12 1.725 0.085 -3.2e+11 5.01e+12\n", "long^7 1.831e+12 1.01e+12 1.809 0.071 -1.54e+11 3.82e+12\n", "long^8 4.676e+11 2.65e+11 1.764 0.078 -5.22e+10 9.87e+11\n", "==============================================================================\n", "Omnibus: 1365.789 Durbin-Watson: 2.050\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 31897.875\n", "Skew: 2.217 Prob(JB): 0.00\n", "Kurtosis: 20.301 Cond. No. 1.47e+16\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The smallest eigenvalue is 1.77e-28. This might indicate that there are\n", "strong multicollinearity problems or that the design matrix is singular.\n", "\"\"\"" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_N = OLS(np.array(y_train).reshape(-1,1), sm.add_constant(X_train_df_N)).fit()\n", "model_N.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also create a model with interaction terms or any other higher order polynomial term of your choice. \n", "**Note:** Can you see how creating a function that takes in a dataframe and a degree and creates polynomial terms up until that degree can be useful? This is what we have you do in your homework!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Regularization\n", "\n", "## What is Regularization and why should I care?\n", "\n", "When we have a lot of predictors, we need to worry about overfitting. Let's check this out:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import r2_score\n", "\n", "\n", "x = [1,2,3,N]\n", "models = [model_1, model_2, model_3, model_N]\n", "X_trains = [X_train_df, X_train_df_2, X_train_df_3, X_train_df_N]\n", "X_vals = [X_val_df, X_val_df_2, X_val_df_3, X_val_df_N]\n", "\n", "r2_train = []\n", "r2_val = []\n", "\n", "for i,model in enumerate(models):\n", " y_pred_tra = model.predict(sm.add_constant(X_trains[i]))\n", " y_pred_val = model.predict(sm.add_constant(X_vals[i]))\n", " r2_train.append(r2_score(y_train, y_pred_tra))\n", " r2_val.append(r2_score(y_val, y_pred_val))\n", " \n", "fig, ax = plt.subplots(figsize=(8,6))\n", "ax.plot(x, r2_train, 'o-', label=r'Training $R^2$')\n", "ax.plot(x, r2_val, 'o-', label=r'Validation $R^2$')\n", "ax.set_xlabel('Number of degree of polynomial')\n", "ax.set_ylabel(r'$R^2$ score')\n", "ax.set_title(r'$R^2$ score vs polynomial degree')\n", "ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We notice a big difference between training and validation R^2 scores: seems like we are overfitting. **Introducing: regularization.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What about Multicollinearity?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's seemingly a lot of multicollinearity in the data. Take a look at this warning that we got when showing our summary for our polynomial models: \n", "\n", "\n", "\n", "What is [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)? Why do we have it in our dataset? Why is this a problem? \n", "\n", "Does regularization help solve the issue of multicollinearity? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What does Regularization help with?\n", "\n", "We have some pretty large and extreme coefficient values in our most recent models. These coefficient values also have very high variance. We can also clearly see some overfitting to the training set. In order to reduce the coefficients of our parameters, we can introduce a penalty term that penalizes some of these extreme coefficient values. Specifically, regularization helps us: \n", "\n", "1. Avoid overfitting. Reduce features that have weak predictive power.\n", "2. Discourage the use of a model that is too complex\n", "\n", "\n", "\n", "### Big Idea: Reduce Variance by Increasing Bias\n", "\n", "Image Source: [here](https://www.cse.wustl.edu/~m.neumann/sp2016/cse517/lecturenotes/lecturenote12.html)\n", "\n", "\n", "\n", "## Ridge Regression\n", "\n", "Ridge Regression is one such form of regularization. In practice, the ridge estimator reduces the complexity of the model by shrinking the coefficients, but it doesn’t nullify them. We control the amount of regularization using a parameter $\\lambda$. **NOTE**: sklearn's [ridge regression package](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) represents this $\\lambda$ using a parameter alpha. In Ridge Regression, the penalty term is proportional to the L2-norm of the coefficients. \n", "\n", "\n", "\n", "## Lasso Regression\n", "\n", "Lasso Regression is another form of regularization. Again, we control the amount of regularization using a parameter $\\lambda$. **NOTE**: sklearn's [lasso regression package](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) represents this $\\lambda$ using a parameter alpha. In Lasso Regression, the penalty term is proportional to the L1-norm of the coefficients. \n", "\n", "\n", "\n", "### Some Differences between Ridge and Lasso Regression\n", "\n", "1. Since Lasso regression tend to produce zero estimates for a number of model parameters - we say that Lasso solutions are **sparse** - we consider to be a method for variable selection.\n", "2. In Ridge Regression, the penalty term is proportional to the L2-norm of the coefficients whereas in Lasso Regression, the penalty term is proportional to the L1-norm of the coefficients.\n", "3. Ridge Regression has a closed form solution! Lasso Regression does not. We often have to solve this iteratively. In the sklearn package for Lasso regression, there is a parameter called `max_iter` that determines how many iterations we perform. \n", "\n", "### Why Standardizing Variables was not a waste of time\n", "\n", "Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to standardize the variables. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's use Ridge and Lasso to regularize our degree N polynomial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**: Play around with different values of alpha. Notice the new $R^2$ value and also the range of values that the predictors take in the plot." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R squared score for our original OLS model: -1.8608470610311345\n", "R squared score for Ridge with alpha=100: 0.5869651490827923\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "# some values you can try out: 0.01, 0.1, 0.5, 1, 5, 10, 20, 40, 100, 200, 500, 1000, 10000\n", "alpha = 100\n", "ridge_model = Ridge(alpha=alpha).fit(X_train_df_N, y_train)\n", "\n", "print('R squared score for our original OLS model: {}'.format(r2_val[-1]))\n", "print('R squared score for Ridge with alpha={}: {}'.format(alpha, ridge_model.score(X_val_df_N,y_val)))\n", "\n", "fig, ax = plt.subplots(figsize=(18,8), ncols=2)\n", "ax = ax.ravel()\n", "ax[0].hist(model_N.params, bins=10, alpha=0.5)\n", "ax[0].set_title('Histogram of predictor values for Original model with N: {}'.format(N))\n", "ax[0].set_xlabel('Predictor values')\n", "ax[0].set_ylabel('Frequency')\n", "\n", "ax[1].hist(ridge_model.coef_.flatten(), bins=20, alpha=0.5)\n", "ax[1].set_title('Histogram of predictor values for Ridge Model with alpha: {}'.format(alpha))\n", "ax[1].set_xlabel('Predictor values')\n", "ax[1].set_ylabel('Frequency');" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R squared score for our original OLS model: -1.8608470610311345\n", "R squared score for Lasso with alpha=0.01: 0.5975930359800542\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABB4AAAHwCAYAAAAB0KxmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOzde5wlZ10n/s9MMswkk0HI0G6CEVjFfAEVwi2wXAS5aRQMKogGVoIK6griuuC6EgQR1lX5AaKoLBBgZaP8DOAiNyMBdLnLJcr1EfkFFsgYZyeoyYRJJun5/VE1pNPT3XO6Z54+fXm/X6+80udMnarvearqnOd86qmqLYcOHQoAAABAD1unXQAAAACwcQkeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8LDBVdWhqrrNvOfOr6q3jH8/r6p+/Cjz+NWqOrdnnb1U1VlV9fmq+mhV3WEVlveWqjp//PuyqrrVEtN+Q1W9q3dNCyz36+u/83IeUVVfrKoPV9VJxzCfnVX1wqr6+6r6RFX9XVW9YLF5VtVtq+r9E8z3bVV1l2Oo64h9q4eqekZVvaZXPVX1A1X10vHv76+q541/T7SdVNWDq2q2qh4+7/nfq6rnTvD6u1TVe8f95eNV9T3LfQ/A6tGv0K9YYLnrpl+xWrVOoqruMO5Pf7XAv71mJd/rc7eXJaZ5cFV9cpnlHn7t1/fvufvxWO8zVjLPOfM+6vutqvdU1WOOZTnz5ndyVV1UVZ+pqlZVj17pdGPb/N7xqm0jOnHaBTBdrbVfnWCyhyT5dO9aOvmBJO9urf3Uai+4tXbWUSa5dZKzV6OWKfnRJK9orT1/pTOoqhOTvDPJB5Kc1Vq7tqpOTvIbSf6iqh7SWrth7mtaa1ckud/R5t1a+76V1rWRtNbenOTN48N7Jzl1BbO5Pslrq+qurbX/u8zX/n6SC1trF1bV3ZO8p6p2z1+vwPqgX9GPfsWx9yvWoANJqqpu31r7YoYHO5Pcf7plLWze/r2e9+PDnpvkmtbanavqdkk+UFUfaa19edLpquqMJC9Jck6SV69m8euN4GGTG4+kfrK19sKq+rUkP5jhR8S+JOcn+aEk90ry21V1Y5J3JXlZkrOSHEry9iS/0lq7oaq+L8lvJrkxyWVJHpbkAUkenOQnk+xM8i9JHpnkD5J8W5LdSa5Ocl5rrVXVe5J8NMl9k3xjkv+e5LQkDxpf/yOttU8s8D6eneTHktyQ5O+TPDXJQ5P8hyQnVNVJrbXHz3vNDUn+W4YPip3j+3jjmBR/vd7W2ndX1U+O89o6ts1TW2ufrarbJnltktsm+eJY8+H5H0oy01r7v1X1X5I8cazvc2PbvjrJSVV1WZJ7Zvix/NtJTh7XwQWttXcsVM+cZTwlyaNaa48aH98pyaVJbjcu76eT3CLDj8n/1lr7g3lt8J4kv9dau3j+46q6c5LfGdfRCUleOv44PGWs/duSzI7r66dba7Nz5vvMJI9O8rWq+oYkv5LkReM6uTHJh5L8x9ba1VX1hfHxXcd18KY5JT42ydbW2i8efmIMH34hyceT/GBV/U2S/53kM0nuML7vv2ytnTKGFH+YYXv654xfkK2188flPibJKUlekOT/S/IdSbaN7+d9VXVmhu19V5LTM2zXj2utHcgiqurA+F4fNs77ueP7+M4kV4zra39VPTALr+9tSV6a5OFJ/inJlRn2m4xt+TvjvLZlWNfPXOxHelW9JMnVrbVnV9Xp4/If0lp7d1U9IcmjMuzDj0ny60l+JsP+8i8ZttPTq+qtGbanGzLsp59ZYFH/kGEdvnqc5/w6fiDJzywS9pyQobOcDO28aNsCa59+hX5F1na/YlFV9chxvrcY2/214/fngvWN7bpg3WM7/vxY25UZ1u/fL7DYG5O8Psnjk/zX8bkfSvK/kvynObUtOL+jbC8LtvcS7/+yJP+ptXZpVf3Y+N5u3Vr7WlW9cnx/90nyySRfy8334yS5Xw0jTv/NOM15rbX985Zx1H7VuH0+NsO+cfskX0nyxPHAUpKcO24Pp2U4OPXksc1/Jcm5SU7KsG0/o7X2prGN3pbk++bM47AfTHJekrTW/k9V/WWSH8mwbU063U8meU+GPuZKDt5sGk612BzeXcPwvMvGD5XnzZ+gqr45yS8kuXdr7V5JLklyn9bay5J8JMOPmzdl+EG0L8MPn3sluVuSZ1TV7iR/lOQJYyL/7iTfNGcR357kweOX2zlJ/rm19u9aa2cm+ZsMX+iH3aG1dv8kT0jyW0neM9b0jiRPW6D2J43zvHdr7a4ZPuxe01r7nxl+dL5+fudgdEKSa1tr98zw4XFhVc3Mr7eqHpThy/aBrbW7jzUd/hJ7WZIPtta+PcMXwp0WqO8HMnQI/l1r7TuSXD6+3ycl+drYXrdKcnGSp4/v4YlJXldV/3aB9pvrj5M8oKpOGx8/KWPHI8mTM3zI3j3J48a6JzKONLg4yS+P7fOgDOv5vhk+fHeNdd97fMm3zH19a+23MxxFf3Fr7ZlJLsjwpXi38b+tGTpDh32ytXbnBToH90vy1/Pra60dytAResD41BlJfn3cnvbMmfTZGQLWO2XosN59kbd8nyT/z9hWr85NX/5PztDxuG+SOyb5t0m+f5F5HLY9yT+21s7O0Bl4ZYZ96y5JviHDF+buLL6+/0OSM8fpH56hs3fYi5N8dFwnd09ymyS/mMW9McO+kSTfm+Qfx3kmw1G7NxyesLX2ody0vzxrfPpbxhq/M8N6WGoY5c8nObOqnjr/H1prb15ihMnPJfkvVfXlDB2InzXaAdY8/Qr9ivXar1isvi0Zfug/cdw27pvhu+k2S9S34PNV9ZAkv5Tku1trd0tyUZI/G5exkP+R5N/PefzEJK+ZU9tS81twezlKey9mfp/hq0keOC7n+3LTNpoF9uNk2D8flqEPc0aGAGW+SftVD8qw7d4lQ+Dx0jn/titD//DOY733r6rbj8t+8Li9Pyvj51Jr7YrW2lkLhA5J8s1JvjTn8ZfH2ieerrX2a62138sQPrEEwcPm8N3jDnfW+OG40DDIryT52yQfq6oXJrmstfZnC0x3Tobk+lBr7boMX8DnJPmuJJ9urf1tkrTWXpvkX+e87u9aa/86/tvFSV5TVU+rqt/JcOTilDnTvnH8/+fH/79jzuOFksRzkrx6Tqr6O0keWlW3WKgx5vm9saa/S/KJ8X3crN4MH4h3TPL+sYP1W0luXVWnZviQe804j3/IcORmvocl+dPW2lfH6X6xtfaCedPcJ8k/jD/+0lr7VJL3ZWib+fV8XWvt6gzt9YSqOiFDYv6q1to1GY4AfX9V/XqGD+BT5r9+CWcm+dYMnabLkvxVhk7H3ZO8N8m3j0cxfjnJS8b3vpRzkvxha+1gG45g/G5u+nJLhhELi9m2yPPbMxwdS4YjPh9YYJrvy9Aes2P7vXaReX2xtXbZ+PfHctN29p+T7K2qX8pwNO22mawdD/+g/3yST7TWvjK+78vHeS+1vh+W5KLW2vXjNv0/58z3kUl+elwnH80wpPY7l6jjvUnOqKp/k6ET8fwkDx/3jQdlOAKwlA/PWbeXZc6RlPnGWn8syQuq6juOMt8kSVXtyHCk5/zW2hkZ9r+Xjz9YgLVLv2Jx+hULW0v9iiOMBzQeleSeVfWcDEeyt2Q4cr5YfYs9/70Zwqm947xfk+FH+R0WWfZHk9xYVfccv/92tdbmXoNhqfkttr0s1d6LeVOSc8ag4YFjGzw8Qwjz+dbaPy7x2iT5s9bata21GzOEdQv1GSbtV13Sbhoh8ookc6//9PrW2o2ttWszjPb5xjacpvLjSR5fVf8twwjOSbbPrbmpL5kM6/zGY5iOJQgeSJKMH9oPypCg70vy4qpaKMmev+NtzfDD8IYMO+Fcc5O/aw7/UVU/m+RVSa7NkNr+8bzXXjevtoNHKf+EBWo6cYF6FjL3yOrW3PQhcs2c509I8kdzOlj3yHBU5qvjcucuZ6EjtTfMra+qblVHXpBq/ns4XM/hH93XZHGvyPBh+71JPtNau7yG880uyzBE7b0ZjgwsZH79hztVJ2QYfjm3Y3nfDB2xyzN0mH4jyS2TvLOqjhhef5T3N/e9LfX+3pfku6rqZp9V4+PvSnL4IpLXtYWPks/fLhf7kvjanL/ntskfJ3lKhuGLL84QSkyyXc3dhhfafo+2vhfbpk5I8tg56+Q+uflRvZsZ9+u3ZAhg7pNhWzk9wxDG948dyaXMrX3+trLQ8j6WIdz44yQ7jjLvZDi15eTW2lvG138wyafGWoF1TL/i66/Tr7ipnrXQr1hQDddV+HiGdfGxJM/M8B24ZbH6lqh7obbfksUPpCTj6J4MIx/+aN6/LTW/xbaXRdt7sQLacMrRLTKMiPxckj9P8ojx8cVL1H7YJH2GSftVi+1HCy6nqu6R4QDULTOMrvrNReY73//JEH4cdtsMoxlWOh1LEDyQJKmqu2VIJz/TWvuNDB8Gh4eN3ZCbPiz/IslTq2pLVW3P8OHxlxl+IJ5ZVXcd5/fDGYb5zf+gTIbU8jWttVclaRkS5hOOofx3JPmJ8UsjGYaa/fV45ORoDl+Z9x4ZhqcdcWXhDO/5x2o4Rz4ZUtRL5yz7KeM8bpdk/pDFZBg+/kNVdcvx8XMzDI+/IcN5olsyfFjeqarOHuf17Rl+WL/naG9g/LG2JcMRp1eMT98ryd4MPwIvyXCUIuPRi7n2jtOmhjs83PXwbDOcR/mE8d++OcP2cc+xg/fqDGn0fx7b5x5HKfMdSX62qraNocHPZdhujubiJPuTvKTGK1iP///dDJ2Kow2hfGuSJ1XV1hqu93BeFt4mF/M9SZ7XWnv9+Pg+ObZt9bCl1vfbk/x4Ve0YRwQ8bs7r/iLJf5yz/705SwQPozdmGKL5idba9RmOhvxG5pxmMcfcfX2lXpjhlI4nTDDtPyT5hqq6X5JU1bdmOMXk48dYAzBl+hX6FWu0X7GYb8vwo/WC1tqfZxgZsj1Dey5Y3xJ1vyPJjx4+zWY8dWdfhu+8xbwuw0GBx2UIz+a/18Xmt9j2smh7H6Ud3pThOiWXtNY+m+EU0cfnplFDc62kzzBpv+qhVXX41KqfyRCCLOW7knyktfaiDPvcoxeZ73z/Kze13xkZwraF7noy6XQsQfBAkmQcyvj/JvlIVX0kyU/kpnPH35zkN6rqiRm+fL8xw/DBT2T4YHtBa+2qDMOs/0dVfSzDB8sNGY4+zPfCDMPF/y7DULiPZUiMV+pVGb6EP1xVn8nwob/QuZcLuf9Y74UZLm7z1fkTtNYOJ6d/OdZ8XpIfGofl/VySu4zLfVWGowHzX/+2DF9M76uqT2S4GM6zMlyL4MMZjvAeyvCF87vjNBcleVJb+EJEC3lFhvMNDw9jvSRDEtsyXHTxdhk6A/Pb+flJHlHDbZWel/F6CuMP1HOT/NT4ni9J8uzW2vsynIt4QpJPV9VHM3wpvTRLe36GH6OXjfVsS/L0o72pcRTDIzKEDB8d6/zY+PjhExy1+o0MFyv8RIZt5J+y8Da5mF9J8qZxnbw8w5fZsWyrSZI23PlhsfX98gznTX5yXN7lc1768xmGfX4iyeFhvEc7x/adGZL5wx2yv8hw4aeFvsTfleR7qup3V/C2knx9uOqPZ7i4W5Kv37LziNM6Wmv/nOEc2d8Z2+HiJE9prX1+/rTA+qJfoV+RNdivGH1vVV0z578vZ/hOfUuSz45t/6gMFwu84xL1Lfh8a+0vMwRt76qqT2W4ZsMj25yLZc7XWvvK+D4+N277c/9tqfktuL0cpb2X8qYMgdnhPsNfJtnTWvvSAtPO3Y8nNWm/6stJ/mh8X3fIcL2YpfxxktuM0386Qz/x1KraVcNt1i+r4SKT8z0nySlju74zwzUrPp8kVfXKqvqZo03H5LYcOrScg3+wsDF1vyDJc9tw14F7ZDjafNvxi3TNqTlXh552LfRRVT+a5F9ba28bj4i8IUOK/wdHeSkAU6RfAZtTDXe1eExr7ZHTroXjy4gHjovxAkXXJ/mbGi5i8/IMt6hak50DNo1PJnnWuE1+MsPtJF853ZIAOBr9CoCNxYgHAAAAoBsjHgAAAIBuTpx2AcuwPcPVkPfEfVMBYK4TMtwq9W8y79aBHHf6IwBwpCX7IuspeLh3hisVAwALe2CS9067iA1OfwQAFrdgX2Q9BQ97kuSrX92f2dn1d12K3btPyb5910y7jDVHuxxJmyxMuxxJmyxsM7bL1q1bcutb70zG70q6WpP9kc243a9F1sPaYD2sDdbD2rBa6+FofZH1FDzcmCSzs4fW1Bf9cqzXunvTLkfSJgvTLkfSJgvbxO1i6H9/a7Y/stbq2aysh7XBelgbrIe1YZXXw4J9EReXBAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHRzYs+ZV9Wjkjwnyc4kl7TWnl5VD0vyoiQnJXl9a+2CnjUAAAAA09NtxENVfUuSP0zy6CR3TXKPqjonyYVJzk1y5yT3Hp8DAAAANqCep1r8YIYRDV9urR1M8rgk1yb5XGvt8tbaDUlel+SxHWsAAAAApqjnqRZ3THJ9Vb05ye2SvCXJp5LsmTPNniRnLGemu3efctwKXG0zM7umXcKapF2OpE0Wpl2OpE0Wpl0AANaOnsHDiUm+K8mDk1yT5M1Jvpbk0JxptiSZXc5M9+27JrOzh44+4RozM7Mre/dePe0y1hztciRtsjDtciRtsrDN2C5bt25Z18E8ALCx9Qwe/jHJO1tre5Okqt6U4bSKG+dMc1qSKzrWAAAAAExRz+DhLUleW1W3SnJ1knOSXJzkl6vqjkkuT3JehotNAgAAABtQt4tLttY+lOS3krw3yaeTfDHJHyQ5P8kbxuc+myGMAAAAADagniMe0lq7MEeOaLg0yd16LhcAAABYG3reThMAAADY5LqOeAA2rhuTXHdwWTelOSZXXnVtDqxgedu3bc0JHeoBYHNYzvfdpN9VvpuAzUbwAKzIdQdnc8kHv7Bqy9u5c3v2779u2a97xH3vkJO3GdwFwMos5/tu0u8q303AZuMTDwAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANDNidMuAACgp6p6VJLnJNmZ5JLW2tOr6mFJXpTkpCSvb61dMM0aAWAjM+IBANiwqupbkvxhkkcnuWuSe1TVOUkuTHJukjsnuff4HADQgeABANjIfjDDiIYvt9YOJnlckmuTfK61dnlr7YYkr0vy2GkWCQAbmVMtAICN7I5Jrq+qNye5XZK3JPlUkj1zptmT5IzlzHT37lOOW4HHy8zMrmmXsCFdedW12blz+8TTTzLtjh3bMnPqycdSFkdhf1gbrIe1YS2sB8EDALCRnZjku5I8OMk1Sd6c5GtJDs2ZZkuS2eXMdN++azI7e+joE66SmZld2bv36mmXsSEdODib/fuvm2janTu3TzTtgQMHra+O7A9rg/WwNqzWeti6dcuSobzgAQDYyP4xyTtba3uTpKrelOG0ihvnTHNakiumUBsAbAqCBwBgI3tLktdW1a2SXJ3knCQXJ/nlqrpjksuTnJfhYpMAQAcuLgkAbFittQ8l+a0k703y6SRfTPIHSc5P8obxuc9mCCMAgA6MeAAANrTW2oU5ckTDpUnuNoVyAGDTMeIBAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADo5sSeM6+qdyf5xiQHx6d+Osm3JrkgybYkL2mtvaxnDQAAAMD0dAseqmpLkjOT3L61dsP43Dcl+ZMk90xyXZL3V9W7W2uf7lUHAAAAMD09RzzU+P9Lqmp3klckuTrJu1prVyVJVV2c5DFJntexDgAAAGBKegYPt05yaZKnZTit4j1JXp9kz5xp9iQ5ezkz3b37lONU3uqbmdk17RLWJO1ypPXQJldedW127ty+qstcyfJ27NiWmVNP7lDN2rAetpVp0C4AAGtHt+ChtfaBJB84/LiqXpXkRUmeP2eyLUlmlzPfffuuyezsoeNS42qamdmVvXuvnnYZa452OdJ6aZMDB2ezf/91q7a8nTu3r2h5Bw4cXBftuRLrZVtZbZuxXbZu3bKug3kAYGPrdleLqnpAVT10zlNbknwhyelznjstyRW9agAAAACmq+epFrdK8ryqul+GUy2emOQJSV5XVTNJ9if54SRP6VgDAAAAMEXdRjy01t6S5K1JPp7ko0kubK29L8mzkrw7yWVJLmqtfbhXDQAAAMB09RzxkNbas5M8e95zFyW5qOdyAQAAgLWh24gHAAAAAMEDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABANydOuwAAgJ6q6t1JvjHJwfGpn07yrUkuSLItyUtaay+bUnkAsOEJHgCADauqtiQ5M8ntW2s3jM99U5I/SXLPJNcleX9Vvbu19unpVQoAG5fgAQDYyGr8/yVVtTvJK5JcneRdrbWrkqSqLk7ymCTPm06JALCxCR4AgI3s1kkuTfK0DKdVvCfJ65PsmTPNniRnL2emu3efcpzKO35mZnZNu4QN6cqrrs3Ondsnnn6SaXfs2JaZU08+lrI4CvvD2mA9rA1rYT0IHgCADau19oEkHzj8uKpeleRFSZ4/Z7ItSWaXM999+67J7Oyh41Lj8TAzsyt791497TI2pAMHZ7N//3UTTbtz5/aJpj1w4KD11ZH9YW2wHtaG1VoPW7duWTKUd1cLAGDDqqoHVNVD5zy1JckXkpw+57nTklyxmnUBwGZixAMAsJHdKsnzqup+GU61eGKSJyR5XVXNJNmf5IeTPGV6JQLAxmbEAwCwYbXW3pLkrUk+nuSjSS5srb0vybOSvDvJZUkuaq19eHpVAsDGZsQDALChtdaeneTZ8567KMlF06kIADYXIx4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANDNib0XUFUvTHKb1tr5VXVWklcmuWWSv07yM621G3rXAAAAAExH1xEPVfXQJE+c89Trkjy1tXZmki1Jntxz+QAAAMB0dQsequrUJC9I8l/Hx7dPclJr7YPjJK9J8theywcAAACmr+eIh5cneVaSr46Pb5tkz5x/35PkjI7LBwAAAKasyzUequqnknyptXZpVZ0/Pr01yaE5k21JMrvcee/efcqxFzglMzO7pl3CmqRdjrQe2uTKq67Nzp3bV3WZK1nejh3bMnPqyR2qWRvWw7YyDdoFAGDt6HVxycclOb2qLktyapJTMoQOp8+Z5rQkVyx3xvv2XZPZ2UNHn3CNmZnZlb17r552GWuOdjnSemmTAwdns3//dau2vJ07t69oeQcOHFwX7bkS62VbWW2bsV22bt2yroN5AGBj63KqRWvt4a2172itnZXkV5O8ubX2pCQHqur+42T/PsnbeywfAAAAWBu63tViAY9P8uKq+myGURAvXeXlAwAAAKuo16kWX9dae02GO1iktfa3Sc7uvUwAAABgbVjtEQ8AAADAJiJ4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDfd72oBADBtVfXCJLdprZ1fVWcleWWSWyb56yQ/01q7YaoFAsAGZsQDALChVdVDkzxxzlOvS/LU1tqZSbYkefJUCgOATULwAABsWFV1apIXJPmv4+PbJzmptfbBcZLXJHnsdKoDgM3BqRYAwEb28iTPSvLN4+PbJtkz59/3JDljuTPdvfuUY6/sOJuZ2TXtEjakK6+6Njt3bp94+kmm3bFjW2ZOPflYyuIo7A9rg/WwNqyF9SB4AAA2pKr6qSRfaq1dWlXnj09vTXJozmRbkswud9779l2T2dlDR59wlczM7MrevVdPu4wN6cDB2ezff91E0+7cuX2iaQ8cOGh9dWR/WBush7VhtdbD1q1blgzlBQ8AwEb1uCSnV9VlSU5NckqG0OH0OdOcluSKKdQGAJuGazwAABtSa+3hrbXvaK2dleRXk7y5tfakJAeq6v7jZP8+ydunViQAbAKCBwBgs3l8khdX1WczjIJ46ZTrAYANzakWAMCG11p7TYY7WKS19rdJzp5mPQCwmRjxAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALqZKHioqqdV1S17FwMAsBj9EQBYnyYd8XDXJH9fVa+sqnv1LAgAYBH6IwCwDk0UPLTWnpzk25J8JMnvV9XfVNVPVNWOrtUBAIz0RwBgfZr4Gg+ttauT/LuzyGgAAB73SURBVGmSi5LsTvJzSVpVPapTbQAAN6M/AgDrz6TXeHhoVb0+yd8nuVOSR7fW7pnkIUle3rE+AIAk+iMAsF6dOOF0L0vy+0me0lr7l8NPttY+X1Wv6FIZAMDN6Y8AwDq0nItL7mut/UtVnVZVv1BVW5OktfacfuUBAHyd/ggArEOTBg+/l+SR49+zSR6Y5CVdKgIAWJj+CACsQ5MGD/drrf1YkrTW/inJY5N8d7eqAACOpD8CAOvQpMHDtqq6xZzHk14bAgDgeNEfAYB1aNIv7Lcm+Yuq+qMkh5KcNz4HALBa9EcAYB2aNHh4Zob7ZJ+b5IYkb4zbVgEAq0t/BADWoYmCh9bajUleOv4HALDq9EcAYH2aKHioqkdnuGr0rZNsOfx8a+2WneoCALgZ/REAWJ8mPdXiN5P8YpKPZTinEgBgtemPAMA6NGnw8M+ttTd2rQQAYGn6IwCwDk16O80PVdU5XSsBAFia/ggArEOTjnj4viRPrarrk1yf4bzKQ86pBABWkf4IAKxDkwYPD+1aBQDA0emPAMA6NNGpFq21Lya5d5InJ9mb5H7jcwAAq0J/BADWp4mCh6r65SQ/m+RHkpyU5DlV9eyehQEAzKU/AgDr06QXl/zRDOdV7m+t7Uty3yTndasKAOBI+iMAsA5NGjwcbK1dd/hBa+2fkxzsUxIAwIL0RwBgHZr04pJfqqrvT3KoqrYneUYS51QCAKtJfwQA1qFJg4enJvmjJHdNsj/JB5M8vldRAAAL0B8BgHVoouChtXZFkodW1clJTmitXd23LACAm9MfAYD1aaLgoap+cd7jJElr7UUdagIAOIL+CACsT5OeavGdc/6+RZIHJbn0+JcDALAo/REAWIcmPdXiSXMfV9Vtk7yqS0UAAAvQHwGA9WnS22nezHiO5R2ObykAAJPTHwGA9WEl13jYkuReSf5pgtc9L8ljkhxK8qrW2ouq6mFJXpTkpCSvb61dsOyqAYBNZ6X9EQBgulZyjYdDSf5Pkmcu9YKqelCSh2S45dW2JJ+uqkuTXJjhnMwvJXlrVZ3TWnv7cgsHADadZfdHAIDpW9E1HiZ8zV9V1Xe31m6oqm8al3WrJJ9rrV2eJFX1uiSPTSJ4AACWtJL+CAAwfZOeavHuDEcWFtRae8gizx+sql9L8owkf5rktkn2zJlkT5IzJq42ye7dpyxn8jVlZmbXtEtYk7TLkdZDm1x51bXZuXP7qi5zJcvbsWNbZk49uUM1a8N62FamQbtsTCvtjwAA0zXpqRYfSXKXJP89yfVJfnx87Z8c7YWttedU1W8m+fMkZ+bmHYYtSWaXU/C+fddkdnbRPseaNTOzK3v3Xj3tMtYc7XKk9dImBw7OZv/+61ZteTt3bl/R8g4cOLgu2nMl1su2sto2Y7ts3bplXQfzy7Di/ggAMD2TBg8PSPKA1tqNSVJVf5Hkg621Nyz2gqq6U5IdrbXLWmvXVtUbM1xo8sY5k52W5IqVlQ4AbDLL7o8AANM3afAwk2RHkv3j411JjjZ2+VuS/FpVPSDDKIdzk7w8yW9X1R2TXJ7kvAwXmwQAOJqV9EcAgCmbNHi4KMkHx1ELW5L8SJLfWeoFrbW3VdXZST6eYZTDG1prf1JVe5O8IUPH4W1JLl5p8QDAprLs/ggAMH2T3tXiV6vq4xluj/m1JD/dWvurCV733CTPnffcpUnutuxKAYBNbaX9EQBgurYuY9qvJPlkkmdnuKATAMBq0x8BgHVmouChqp6U5NVJfinJNyT5X1X15J6FAQDMtdL+SFU9r6o+XVWfqqpfHJ97WFX9XVV9rqqe37dyANjcJh3x8LQk/y7Jv7bW/inJPZP8QreqAACOtOz+SFU9KMOpGXdNcq8kT6uqu2W4uPW5Se6c5N5VdU7PwgFgM5s0eLixtfavhx+01r6U5IY+JQEALGjZ/ZHxGhDf3Vq7Ick3Zri+1a2SfK61dvn4/OuSPLZf2QCwuU16V4urquqsDLfFTFU9PslV3aoCADjSivojrbWDVfVrSZ6R5E+T3DbJnjmT7ElyxnIK2b37lOVMvipmZnZNu4QN6cqrrs3Ondsnnn6SaXfs2JaZU90Jtif7w9pgPawNa2E9TBo8PD3DbS+/tar2ZLiS9LndqgIAONKK+yOttedU1W8m+fMkZ2YML0Zbkswup5B9+67J7Oyho0+4SmZmdmXv3qunXcaGdODgbPbvv26iaXfu3D7RtAcOHLS+OrI/rA3Ww9qwWuth69YtS4bykwYPJ2e4BeaZSU5I0lprB4+9PACAiS27P1JVd0qyo7V2WWvt2qp6Y5LHJLlxzmSnJbmiU80AsOlNGjz8z9banZN8pmcxAABLWEl/5FuS/FpVPSDDKIdzk7w8yW9X1R2TXJ7kvAwXmwQAOpg0ePi7qjovyXuTXHP4ydaa6zwAAKtl2f2R1trbqursJB/PMMrhDa21P6mqvUnekGRHkrdlOIUDAOhg0uDh3Bx5tedDGYY5AgCshhX1R1prz03y3HnPXZrhtA0AoLOJgofW2o7ehQAALEV/BADWp61L/WNV/fc5f9+mfzkAADenPwIA69uSwUOSe835+5KehQAALEJ/BADWsaMFD1sW+RsAYLXojwDAOna04GGuQ92qAACYjP4IAKwzR7u45NaqunWGowsnzPk7idtpAgCrQn8EANaxowUP35nk/+amL/d9c/7N7TQBgNWgPwIA69iSwUNrbTmnYgAAHHf6IwCwvvkiBwAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuTuw586p6TpIfGR++tbX2S1X1sCQvSnJSkte31i7oWQMAAAAwPd1GPIwBwyOS3D3JWUnuWVU/luTCJOcmuXOSe1fVOb1qAAAAAKar56kWe5L8p9ba9a21g0k+k+TMJJ9rrV3eWrshyeuSPLZjDQAAAMAUdTvVorX2qcN/V9W3ZTjl4nczBBKH7UlyxnLmu3v3KcelvmmYmdk17RLWJO1ypPXQJldedW127ty+qstcyfJ27NiWmVNP7lDN2rAetpVp0C4AAGtH12s8JElVfXuStyZ5ZpIbMox6OGxLktnlzG/fvmsyO3vo+BW4SmZmdmXv3qunXcaao12OtF7a5MDB2ezff92qLW/nzu0rWt6BAwfXRXuuxHrZVlbbZmyXrVu3rOtgHgDY2Lre1aKq7p/k0iS/3Fp7bZIvJzl9ziSnJbmiZw0AAADA9HQb8VBV35zkz5I8rrX2rvHpDw3/VHdMcnmS8zJcbBIAAADYgHqeavGMJDuSvKiqDj/3h0nOT/KG8d/eluTijjUAAJuYW3sDwPT1vLjk05M8fZF/vluv5QIAJEfc2vtQkneMt/b+zSQPSvKlJG+tqnNaa2+fXqUAsLF1vcYDAMAUubU3AKwB3e9qAQAwDb1u7Z2szdt7u41sH8u9ffQk0270Wz2vBfaHtcF6WBvWwnoQPAAAG9rxvrV3svZu770ZbyO7WpZz++hJb/28kW/1vBbYH9YG62FtWK31cLRbezvVAgDYsNzaGwCmz4gHAGBDcmtvAFgbBA8AwEbl1t4AsAYIHgCADcmtvQFgbXCNBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3ZzYewFVdcsk70/yyNbaF6rqYUlelOSkJK9vrV3QuwYAAABgOrqOeKiq+yR5b5Izx8cnJbkwyblJ7pzk3lV1Ts8aAAAAgOnpfarFk5P8XJIrxsdnJ/lca+3y1toNSV6X5LGdawAAAACmpOupFq21n0qSqjr81G2T7JkzyZ4kZyxnnrt3n3JcapuGmZld0y5hTdIuR1oPbXLlVddm587tq7rMlSxvx45tmTn15A7VrA3rYVuZBu3CXE77BIDp6n6Nh3m2Jjk05/GWJLPLmcG+fddkdvbQ0SdcY2ZmdmXv3qunXcaao12OtF7a5MDB2ezff92qLW/nzu0rWt6BAwfXRXuuxHrZVlbbZmyXrVu3rOtgvqfxtM9X5MjTPh+U5EtJ3lpV57TW3j69KgFgY1vtu1p8Ocnpcx6flptOwwAAON6c9gkAU7baIx4+lKSq6o5JLk9yXoajDgAAx12P0z6TtXnqp1OM+ljuqYWTTLvRTwNcC+wPa4P1sDashfWwqsFDa+1AVZ2f5A1JdiR5W5KLV7MGAGBTO+bTPpO1d+rnZjzFaLUs59TCSU8L3MinAa4F9oe1wXpYG1ZrPRzttM9VCR5aa3eY8/elSe62GssFAJjHaZ8AsMpW+1QLAIBpctonAKyy1b64JADA1LTWDiQ5P8Npn59O8tk47RMAujLiAQDY8Jz2CQDTY8QDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALoRPAAAAADdCB4AAACAbgQPAAAAQDeCBwAAAKAbwQMAAADQjeABAAAA6EbwAAAAAHRz4rQLAOhp69Ytufbg7LTLWNL2bVtzwrSLAACATgQPwIZ2/Q2zedeHvzjtMpb0iPveISdvMwANAICNSU8XAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0I3gAQAAAOhG8AAAAAB0I3gAAAAAuhE8AAAAAN0IHgAAAIBuBA8AAABAN4IHAAAAoBvBAwAAANCN4AEAAADoRvAAAAAAdCN4AAAAALo5cdoFAEe6Mcl1B2enXcaSDk27AAAAYF0QPMAadN3B2VzywS9Mu4wlPeTs20+7BAAAYB1wqgUAAADQjeABAAAA6EbwAAAAAHQjeAAAAAC6ETwAAAAA3birBQAbxo1Jrrzq2hxY47ej3b5ta06YdhEAAKtE8ADAhnHdwdm872Nfyf791027lCU94r53yMnbDDoEjp8bM3wGHk/rISRd6H0fawC9Ht43rDeCBwAAWOeuOzibSz74heM6z/UQki70vnfu3H5MAfR6eN+w3tijAAAAgG4EDwAAAEA3ggcAAACgG8EDAAAA0M2mv7hkjysAL+RYrq7ryrqwsW3duiXXruDzYbVvG3mLbVtz/Rq/TeWhaRcAAMARNn3w0OMKwAs5lqvrurIubGzX3zCbd334i8t+3bFetXu5HnL27VdU52p6yNm3n3YJsCLHeiBkfhDZ46BFj4M1PQLN4z3PHoHmSgPnpQheYWNb6WfwUgeqVvMA96YPHgAApu1YD4TMDyJ7HLTocbCmR6B5vOfZI9BcaeC8FMErbGwr/Qxe6kDVah7gdhgdAAAA6GYqIx6q6rwkFyTZluQlrbWXTaMOAGDz0h8BgNWx6iMequqbkrwgyQOSnJXkKVV1l9WuAwDYvPRHAGD1TGPEw8OSvKu1dlWSVNXFSR6T5HlHed0JyXAxnuPphBO25JSTtx3XeS7k5B3bsuXQyi4idMIJW477+15LNvJ7W6nV2i6PxYknbF3VGle6D612nSux0hqP5XNlJdZLW+5c5XZZieP9uT5nXm6CNLkN1R+Z/3nQo+/Q47upx+fK8Z7ncuY36efyenjfyfrogy60XR7r9+N6eN/rhXY8flb6GbzU/nA8t/Wj9UW2HDq0utfArar/kmRna+2C8fFPJTm7tfaUo7z0AUn+d+/6AGAde2CS9067iPVAfwQAuliwLzKNEQ9bc/M7/mxJMkkk+TcZ3sSeDHcTAQAGJyQ5PcN3JZPRHwGA42fJvsg0gocvZ/jCPuy0JFdM8Lrr4igOACzm89MuYJ3RHwGA42vRvsg0god3JnluVc0k2Z/kh5McbVgjAMDxpD8CAKtk1e9q0Vr7SpJnJXl3ksuSXNRa+/Bq1wEAbF76IwCwelb94pIAAADA5rHqIx4AAACAzUPwAAAAAHQjeAAAAAC6ETwAAAAA3QgeAAAAgG5OnHYBm0VVPTDJS5LcIsnlSZ7YWvvqdKuavqq6f5IXZ2iXfUl+orX2xelWtXZU1a8nubG19txp1zItVXVekguSbEvyktbay6Zc0ppRVbdM8v4kj2ytfWHK5UxdVT0nyY+MD9/aWvuladYDvVTV7ZL/v707DZKrKsM4/g8BFEQRFYEgiAp5wLgkREHZl5QfCBhFERFBQDYNImrcQbEKEUVB3AATgUhcQUFQZDPiAkYrAkZZHhPLja1Ko4KoSDDxwzkTmmFmenC6b7eZ51dF0ff26dvv9M3t8/bZLguApwMGDrZ9/6AymwHnA5sCK4E5thc2Heuaql3dJGkqMA94EvBD4FjbDzUe6BpuFOdhFvAhYAIl/z48+XfnjTZXkzQT+IztZzUZ33gxiutBwLnARsA9wGubvB4y4qE55wOH2H4+cCvwzh7H0y++BBxpe2p9/Kkex9MXJG0o6QvAO3odSy9J2hz4MLALMBU4WtJzextVf5C0I/BjYHKvY+kHkmYALwOmUf6tTJf0yt5GFdE1nwM+Z3tbYDFw0hBlTgcur/XrQcCXJU1sMMY11ijrpgXAcbYnU370HtVslGu+duehNs6fDcy0/UJgCXByD0Jdo402V5O0CfBxyvUQHTaK62ECcBlwWr0ebgLe02SMaXhozna2b5W0DrA5MO5bWyU9DjjR9pK6awmwZQ9D6iezgKXAJ3odSI/NABba/ovtfwAXA6/ucUz94ihgNnBXrwPpE3cD77D9oO0VwG3k+yTWQDWP2I3yfQhwAXDAEEUvAb5cHy8DHg9s0O34xokR6yZJzwTWs72o7rqAoc9RjE27HGEdYLbtO+t28szuGG2uNo8y+iS6o9152B74h+0r6/apQKOjiDPVoiG2V0h6PnAtsAJ4X49D6jnb/6b0CCBpLUor9KW9jKlf2P4igKSTexxKr02i/KAccDewQ49i6Su2jwQoo+bC9i0DjyVtQ5lysXPvIoromqcB97UM278beMbgQra/0bI5B7jJ9r0NxDcetKubhnr+UecoxmzE82B7OaUBDknrUXp3P91kgONE21xN0vHAjcAiolvanYetgXvqiOpplA6atzQXXhoeOk7SAZQ1C1rdbnuG7V8Cm0g6BvgasFPjAfbISJ+LpHWB+ZR/j6c2HlwPjfS59CKePrQWsKplewJlrnLEkCRNAb4DvNP20l7HEzEWw9QRS3nk9yKM8L0o6QTgGGD3zkY3rrWrm1J3NWNUn7OkDSkNEL+wPb+h2MaTEc+DpOcBrwL2Jg1w3dTuelgb2APYzfbiuo7cGcBhTQWYhocOs30RcFHrPkmPl/QK2wO9+QsYZ0Poh/pcACRtQJlvtByYVYdIjxvDfS6x2h3Ari3bm5KpBTGMuljtN4ATbH+11/FEjNUwOcU6wHJJE23/B9iMYb4XJX0MmElJNO/odrzjSLu66Q7KeRnu+eiMtjlCXWT1KmAh8LbmQhtX2p2HAyjXw2LKYvKTJP3IdutrYuzanYd7gKW2F9ftr/DwlL1GZI2HZqwAPitpet1+DWVRuCiNMMuAA+vUi4hW1wJ7S9pY0vqUFvMr27wmxiFJW1Cmar0ujQ6xJqsN9D8CDqy7DgW+O7hcHemwJ7BzGh06bsS6qd6d64HaGApwCEOcoxizEc9DXUz1cuDrtk+wPXikUHRGu+vhg7Yn14Vu9wHuSqNDV7TLmW8ANpb0wrq9H/DzJgPMiIcG2P6PpAOBz9cvwTuBI3scVs9JmkZZRPFW4MY6V/0u2/v0NLDoG7bvlPR+4PuUVvJ5tn/W47CiP82hLJ53Rsu6F+fYPqd3IUV0zZuB+ZJOBP5AuWsFko6lzPP9YP3vPuC6lmtiH9vpeR+j4eomSVcAH6g9igcDc+udFW4kd+3quHbnAdiCsqDe2pIGFtlbPLBGUnTGKK+H6LLRnId6t6+5kp5AGSFxSJMxTli1Ko1/EREREREREdEdmWoREREREREREV2ThoeIiIiIiIiI6Jo0PERERERERERE16ThISIiIiIiIiK6Jne1iIiIaEhd4f4GYF/bvxtF+S8CC21fULd3Bs6krFi9HDii3rovIiIiom+l4SGij0jaCvgN8MuW3ROAs2yfN8Zjfxu42PYFkm4G9rD9t2HKbghcYnuvsbzn/xDjYcCrbe/b5PtGNEHSjsBcYPIoyk4CzgX2Bha2PPUl4OW2l0g6gnKLvlldCDcixrHkI8lHIjotDQ8R/edftqcObEjaHPiVpMW2l3TiDVqPP4yNgB068V4RsdpRwGzgwoEdkg4FTqBMffw5MNv2A8DBwLcooxoGyj4OOLHle2AJ8JZmQo+IcSj5SER0TBoeIvqc7TslLQUmS9oeeCPwBOBe23tKeiPwZsoPl+XAcbZvrz2m84FJwO+Bpw8cU9IqYGPbf5b0XuANwEPAUuAw4HxgvdoTMR3YCTgdWB94kPLj58raI/CIeFre42hgP9v71e1tge8BW9b3O4YyXPwpwGm2z279uyVdB3zG9sWDtyVtB5wFPBWYCHzK9nmSNqixbwOspPyQO8b2yv/pw4/oINtHAkii/n8KpTFiJ9sPSPoIMAc4xfbptcwuLa//N7Cg7l8LOBm4tME/ISLGseQjyUcixiKLS0b0OUkvBbYGflp3TaEMS9xT0u6USnNX29OAjwGX1HKfBRbZngIcD2w7xLFfTqnYX2r7ecBvgeOAw3m4p+PJwMXAW22/oL7fAknPGhzPoMN/BdhF0qZ1+3BqAkH5sbVPjfnAGvdoP4+1azzvsT0d2B2YI+klwCuBJ9a4X1xf8uzRHjuiYXtSktJFNamexRDX6WCS1qVMuVgbOLWrEUZEVMlHHhVz8pGIxyAjHiL6z0DLPpRr9M/Awbb/WHtKl9i+rz4/k5IE3DDQiwpsJOkpwAxK7ym2l0lqnSc+YAZwke2/1nJvh9VzOwfsCCyz/dNa5hZJ1wN7AKsGxbOa7b9L+ibweklnUoaO72r7fkn7AjMlbQNMBTZ4DJ/PZOA5wHktf/N6wDTgSuDU2htxDfBJ28sew7EjmjQR+Lrt4wFqD9mI9XItcxmlN3GW7RVdjzIixqvkIyNLPhLxGKThIaL/PGJO5RDub3k8EbjQ9rth9fDrScBfKZXwhJayDw1xrIdqOerrn0zpUWg1sbVMtRawDmWY4/0Mby7weeA24Dbbv5X0DOAndf+PKb0FQy3eNDj+dVviuXfQvNNN6r4HJG1NSUL2Aq6VdLTty0eIMaJXrqP0jp0C/Ak4m7KY28kjvGYBsAw4NkN2I6LLko8UyUciOiBTLSL+v10FHCRps7p9LGXeIpTW9qMBJG1JGdY92LXA/vUWf1B+8LydkgBMlDSBUilvK2mHeqwpwG6UH00jsr2IUll/gFLpA7yI8iPrFOBqaiUvaeKgl/+plkXSc4EXDBwW+Jek19fntgB+BUyX9CbK8Mmra/JzFbB9uzgjesH2L4APUe5acQsliT1tuPKSplGmY+wM3CjpZklXNBFrREQbyUeSj0SMKCMeIv6P2b5a0keBayStBO4D9re9StJs4HxJtwF3ADcP8foraiV6fR0meAtlvuM/gZ/V7V2BA4BPS1qfskjS4bZ/LWmnUYQ5FziJhxfBuxo4glJhrwR+QKnUtx70ulOA+ZJmArcDP6wxPyhpFnCWpHdRejpOsn19HRK6B3CrpH8Cf6DcbjCib9jequXxPGDeCGUPa3l8E4/sdYuI6AvJR5KPRLQzYdWqwSOWIiIiIiIiIiI6I1MtIiIiIiIiIqJr0vAQEREREREREV2ThoeIiIiIiIiI6Jo0PERERERERERE16ThISIiIiIiIiK6Jg0PEREREREREdE1aXiIiIiIiIiIiK75L6uKUhivTWngAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import Lasso\n", "\n", "# some values you can try out: 0.00001, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 20\n", "alpha = 0.01\n", "lasso_model = Lasso(alpha=alpha, max_iter = 1000).fit(X_train_df_N, y_train)\n", "\n", "print('R squared score for our original OLS model: {}'.format(r2_val[-1]))\n", "print('R squared score for Lasso with alpha={}: {}'.format(alpha, lasso_model.score(X_val_df_N,y_val)))\n", "\n", "fig, ax = plt.subplots(figsize=(18,8), ncols=2)\n", "ax = ax.ravel()\n", "ax[0].hist(model_N.params, bins=10, alpha=0.5)\n", "ax[0].set_title('Histogram of predictor values for Original model with N: {}'.format(N))\n", "ax[0].set_xlabel('Predictor values')\n", "ax[0].set_ylabel('Frequency')\n", "\n", "ax[1].hist(lasso_model.coef_.flatten(), bins=20, alpha=0.5)\n", "ax[1].set_title('Histogram of predictor values for Lasso Model with alpha: {}'.format(alpha))\n", "ax[1].set_xlabel('Predictor values')\n", "ax[1].set_ylabel('Frequency');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Selection and Cross-Validation\n", "\n", "Here's our current setup so far: \n", "\n", "\n", "\n", "So we try out 10,000 different models on our validation set and pick the one that's the best? No! **Since we could also be overfitting the validation set!** \n", "\n", "One solution to the problems raised by using a single validation set is to evaluate each model on multiple validation sets and average the validation performance. This is the essence of cross-validation!\n", "\n", "\n", "\n", "Image source: [here](https://medium.com/@sebastiannorena/some-model-tuning-methods-bfef3e6544f0)\n", "\n", "Let's give this a try using [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) and [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R^2 score for our original OLS model: -1.8608470610311345\n", "\n", "Best alpha for ridge: 1000.0\n", "R^2 score for Ridge with alpha=1000.0: 0.5779474940635888\n", "\n", "Best alpha for lasso: 0.01\n", "R squared score for Lasso with alpha=0.01: 0.5975930359800542\n" ] } ], "source": [ "from sklearn.linear_model import RidgeCV\n", "from sklearn.linear_model import LassoCV\n", "\n", "alphas = (0.001, 0.01, 0.1, 10, 100, 1000, 10000)\n", "\n", "# Let us do k-fold cross validation \n", "k = 4\n", "fitted_ridge = RidgeCV(alphas=alphas).fit(X_train_df_N, y_train)\n", "fitted_lasso = LassoCV(alphas=alphas).fit(X_train_df_N, y_train)\n", "\n", "print('R^2 score for our original OLS model: {}\\n'.format(r2_val[-1]))\n", "\n", "ridge_a = fitted_ridge.alpha_\n", "print('Best alpha for ridge: {}'.format(ridge_a))\n", "print('R^2 score for Ridge with alpha={}: {}\\n'.format(ridge_a, fitted_ridge.score(X_val_df_N,y_val)))\n", "\n", "lasso_a = fitted_lasso.alpha_\n", "print('Best alpha for lasso: {}'.format(lasso_a))\n", "print('R squared score for Lasso with alpha={}: {}'.format(lasso_a, fitted_lasso.score(X_val_df_N,y_val)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the coefficients of our CV models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Final Step:** report the score on the test set for the model you have chosen to be the best." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------\n", "### End of Standard Section\n", "---------------" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }