{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise: Bias Variance Tradeoff\n",
" \n",
"## Description :\n",
"The aim of this exercise is to understand **bias variance tradeoff**. For this, you will fit a polynomial regression model with different degrees on the same data and plot them as given below.\n",
"\n",
"
\n",
"\n",
"## Data Description:\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the file `noisypopulation.csv` as a Pandas dataframe.\n",
"- Assign the response and predictor variables appropriately as mentioned in the scaffold.\n",
"- Perform sampling on the dataset to get a subset.\n",
"- For each sampled version fo the dataset:\n",
" - For degree of the chosen degree value:\n",
" - Compute the polynomial features for the training\n",
" - Fit the model on the given data\n",
" - Select a set of random points in the data to predict the model\n",
" - Store the predicted values as a list\n",
"- Plot the predicted values along with the random data points and true function as given above.\n",
"\n",
"\n",
"## Hints: \n",
"\n",
"FUNCTION SIGNATURE:\n",
"gen(degree, number of samples, number of points, x, y)\n",
"\n",
"sklearn.PolynomialFeatures()\n",
"Generates polynomial and interaction features\n",
"\n",
"sklearn.LinearRegression()\n",
"LinearRegression fits a linear model\n",
"\n",
"sklearn.fit()\n",
"Fits the linear model to the training data\n",
"\n",
"sklearn.predict()\n",
"Predict using the linear model.\n",
"\n",
"Note: This exercise is **auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Import necessary libraries\n",
"%matplotlib inline\n",
"import scipy as sp\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib as mpl\n",
"import matplotlib.cm as cm\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.preprocessing import PolynomialFeatures\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Helper function to define plot characteristics\n",
"def make_plot():\n",
" fig, axes=plt.subplots(figsize=(20,8), nrows=1, ncols=2);\n",
" axes[0].set_ylabel(\"$p_R$\", fontsize=18)\n",
" axes[0].set_xlabel(\"$x$\", fontsize=18)\n",
" axes[1].set_xlabel(\"$x$\", fontsize=18)\n",
" axes[1].set_yticklabels([])\n",
" axes[0].set_ylim([0,1])\n",
" axes[1].set_ylim([0,1])\n",
" axes[0].set_xlim([0,1])\n",
" axes[1].set_xlim([0,1])\n",
" plt.tight_layout();\n",
" return axes\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Reading the file into a dataframe\n",
"df = pd.read_csv(\"noisypopulation.csv\")\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"###edTest(get_data)###\n",
"\n",
"# Set column x is the predictor and column y is the response variable.\n",
"# Column f is the true function of the given data\n",
"# Select the values of the columns\n",
"\n",
"x = df.___\n",
"f = df.___\n",
"y = df.___\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Function to compute the Polynomial Features for the data x \n",
"# for the given degree d\n",
"def polyshape(d, x):\n",
" return PolynomialFeatures(___).fit_transform(___.reshape(-1,1))\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Function to fit a Linear Regression model \n",
"def make_predict_with_model(x, y, x_pred):\n",
" \n",
" # Create a Linear Regression model with fit_intercept as False\n",
" lreg = ___\n",
" \n",
" # Fit the model to the data x and y got parameters to the function\n",
" lreg.fit(___, ___)\n",
" \n",
" # Predict on the x_pred data got as a parameter to this function\n",
" y_pred = lreg.predict(___)\n",
"\n",
" # Return the linear model and the prediction on the test data\n",
" return lreg, y_pred\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Function to perform sampling and fit the data, with the following parameters\n",
"\n",
"# degree is the maximum degree of the model\n",
"# num_sample is the number of samples\n",
"# size is the number of random points selected from the data for each sample\n",
"# x is the predictor variable\n",
"# y is the response variable\n",
"\n",
"def gen(degree, num_sample, size, x, y):\n",
" \n",
" # Create 2 lists to store the prediction and model\n",
" predicted_values, linear_models =[], []\n",
" \n",
" # Loop over the number of samples\n",
" for i in range(num_sample):\n",
" \n",
" # Helper code to call the make_predict_with_model function to fit on the data\n",
" indexes=np.sort(np.random.choice(x.shape[0], size=size, replace=False))\n",
" \n",
" # lreg and y_pred hold the model and predicted values for the current sample\n",
" lreg, y_pred = make_predict_with_model(polyshape(degree, x[indexes]), y[indexes], polyshape(degree, x))\n",
" \n",
" # Append the model and predicted values to the appropriate lists\n",
" predicted_values.append(___)\n",
" linear_models.append(___)\n",
" \n",
" # Return the 2 lists, one for predicted values and one for the model\n",
" return predicted_values, linear_models\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Call the function gen() twice with x and y as the \n",
"# predictor and response variable respectively\n",
"\n",
"# Set the number of samples to 200 and the number of points as 30\n",
"# Store the return values in appropriate variables\n",
"\n",
"# Get results for degree 1\n",
"predicted_1, model_1 = gen(___);\n",
"\n",
"# Get results for degree 100\n",
"predicted_100, model_100 = gen(___);\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to plot the data\n",
"indexes = np.sort(np.random.choice(x.shape[0], size=30, replace=False))\n",
"\n",
"plt.figure(figsize = (12,8))\n",
"axes=make_plot()\n",
"\n",
"# Plot for Degree 1\n",
"axes[0].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n",
"axes[0].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n",
"axes[0].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n",
"\n",
"for i,p in enumerate(predicted_1[:-1]):\n",
" axes[0].plot(x,p,alpha=0.03,color='#FF9300')\n",
"axes[0].plot(x, predicted_1[-1], alpha=0.3,color='#FF9300',label=\"Degree 1 from different samples\")\n",
"\n",
"\n",
"# Plot for Degree 100\n",
"axes[1].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n",
"axes[1].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n",
"axes[1].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n",
"\n",
"\n",
"for i,p in enumerate(predicted_100[:-1]):\n",
" axes[1].plot(x,p,alpha=0.03,color='#FF9300')\n",
"axes[1].plot(x,predicted_100[-1],alpha=0.2,color='#FF9300',label=\"Degree 100 from different samples\")\n",
"\n",
"axes[0].legend(loc='best')\n",
"axes[1].legend(loc='best')\n",
"\n",
"plt.show();\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ⏸ Does changing the degree from 100 to 10 reduce variance? Why or why not?\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow1) ###\n",
"# Submit an answer choice as a string below \n",
"answer1 = '___'\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}