{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise: 1 - Bias Variance Tradeoff**\n", "\n", "# Description\n", "\n", "The aim of this exercise is to understand **bias variance tradeoff**. For this, you will fit different degree polynomial regression on the same data and plot them as given below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions:\n", "- Read the file noisypopulation.csv as a pandas dataframe.\n", "- Assign the response and predictor variables appropriately.\n", "- Perform bootstrap operation on the dataset. \n", "- For each bootstrap:\n", " - For degree of the chosen degree value:\n", " - Compute the polynomial features\n", " - Fit the model on the given data\n", " - Select a set of random points in the data to predict the model\n", " - Store the predicted values as a list\n", "- Plot the predicted values along with the random data points and true function as given above.\n", "\n", "\n", "# Hints:\n", "\n", "sklearn.PolynomialFeatures() : Generates polynomial and interaction features\n", "\n", "sklearn.LinearRegression() : LinearRegression fits a linear model\n", "\n", "sklearn.fit() : Fits the linear model to the training data\n", "\n", "sklearn.predict() : Predict using the linear model.\n", "\n", "Note: This exercise is **auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "#Importing libraries\n", "%matplotlib inline\n", "import numpy as np\n", "import scipy as sp\n", "import matplotlib as mpl\n", "import matplotlib.cm as cm\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import PolynomialFeatures" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "#Helper function to set plot characteristics\n", "\n", "def make_plot():\n", " fig, axes=plt.subplots(figsize=(20,8), nrows=1, ncols=2);\n", " axes[0].set_ylabel(\"$p_R$\", fontsize=18)\n", " axes[0].set_xlabel(\"$x$\", fontsize=18)\n", " axes[1].set_xlabel(\"$x$\", fontsize=18)\n", " axes[1].set_yticklabels([])\n", " axes[0].set_ylim([0,1])\n", " axes[1].set_ylim([0,1])\n", " axes[0].set_xlim([0,1])\n", " axes[1].set_xlim([0,1])\n", " plt.tight_layout();\n", " return axes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fxy
00.0477900.000.011307
10.0511990.010.010000
20.0547990.020.007237
30.0585960.030.000056
40.0625970.040.010000
\n", "
" ], "text/plain": [ " f x y\n", "0 0.047790 0.00 0.011307\n", "1 0.051199 0.01 0.010000\n", "2 0.054799 0.02 0.007237\n", "3 0.058596 0.03 0.000056\n", "4 0.062597 0.04 0.010000" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reading the file into a dataframe\n", "\n", "df = pd.read_csv(\"noisypopulation.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "### edTest(test_data) ###\n", "\n", "# Set the predictor and response variable\n", "# Column x is the predictor and column y is the response variable.\n", "# Column f is the true function of the given data\n", "# Select the values of the columns\n", "\n", "x = df.___\n", "y = df.___\n", "f = df.___" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "### edTest(test_poly) ###\n", "# Function to compute the Polynomial Features for the data x for the given degree d\n", "def polyshape(d, x):\n", " return PolynomialFeatures(___).fit_transform(___.reshape(-1,1))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "### edTest(test_linear) ###\n", "#Function to fit a Linear Regression model \n", "def make_predict_with_model(x, y, x_pred):\n", " \n", " #Create a Linear Regression model with fit_intercept as False\n", " lreg = ___\n", " \n", " #Fit the model to the data x and y\n", " lreg.fit(___, ___)\n", " \n", " #Predict on the x_pred data\n", " y_pred = lreg.predict(___)\n", " return lreg, y_pred" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "#Function to perform bootstrap and fit the data\n", "\n", "# degree is the maximum degree of the model\n", "# num_boot is the number of bootstraps\n", "# size is the number of random points selected from the data for each bootstrap\n", "# x is the predictor variable\n", "# y is the response variable\n", "\n", "def gen(degree, num_boot, size, x, y):\n", " \n", " #Create 2 lists to store the prediction and model\n", " predicted_values, linear_models =[], []\n", " \n", " #Run the loop for the number of bootstrap value\n", " for i in range(num_boot):\n", " \n", " #Helper code to call the make_predict_with_model function to fit the data\n", " indexes=np.sort(np.random.choice(x.shape[0], size=size, replace=False))\n", " \n", " #lreg and y_pred hold the model and predicted values of the current bootstrap\n", " lreg, y_pred = make_predict_with_model(polyshape(degree, x[indexes]), y[indexes], polyshape(degree, x))\n", " \n", " #Append the model and predicted values into the appropriate lists\n", " predicted_values.append(___)\n", " linear_models.append(___)\n", " \n", " #Return the 2 lists\n", " return predicted_values, linear_models" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "### edTest(test_gen) ###\n", "# Call the function gen twice with x and y as the predictor and response variable respectively\n", "# The number of bootstraps should be 200 and the number of samples from the dataset should be 30\n", "# Store the return values in appropriate variables\n", "\n", "# Get results for degree 1\n", "predicted_1, model_1 = gen(___);\n", "\n", "# Get results for degree 100\n", "predicted_100, model_100 = gen(___);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Helper code to plot the data\n", "\n", "indexes=np.sort(np.random.choice(x.shape[0], size=30, replace=False))\n", "\n", "plt.figure(figsize = (12,8))\n", "axes=make_plot()\n", "\n", "#Plot for Degree 1\n", "axes[0].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n", "axes[0].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n", "axes[0].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n", "\n", "for i,p in enumerate(predicted_1[:-1]):\n", " axes[0].plot(x,p,alpha=0.03,color='#FF9300')\n", "axes[0].plot(x, predicted_1[-1], alpha=0.3,color='#FF9300',label=\"Degree 1 from different samples\")\n", "\n", "\n", "#Plot for Degree 100\n", "axes[1].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n", "axes[1].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n", "axes[1].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n", "\n", "\n", "for i,p in enumerate(predicted_100[:-1]):\n", " axes[1].plot(x,p,alpha=0.03,color='#FF9300')\n", "axes[1].plot(x,predicted_100[-1],alpha=0.2,color='#FF9300',label=\"Degree 100 from different samples\")\n", "\n", "axes[0].legend(loc='best')\n", "axes[1].legend(loc='best')\n", "\n", "#edgecolor='black', linewidth=3, facecolor='green',\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### After you mark the exercise, run the code again, but this time with degree 10 instead of 100. Do you see a decrease in variance? Why are the edges still so erractic?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also check the values of the coefficients for each of your runs. Do you see a pattern? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answer here " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }