{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Bias Variance Tradeoff\n", " \n", "## Description :\n", "The aim of this exercise is to understand **bias variance tradeoff**. For this, you will fit a polynomial regression model with different degrees on the same data and plot them as given below.\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "- Read the file `noisypopulation.csv` as a Pandas dataframe.\n", "- Assign the response and predictor variables appropriately as mentioned in the scaffold.\n", "- Perform sampling on the dataset to get a subset.\n", "- For each sampled version fo the dataset:\n", " - For degree of the chosen degree value:\n", " - Compute the polynomial features for the training\n", " - Fit the model on the given data\n", " - Select a set of random points in the data to predict the model\n", " - Store the predicted values as a list\n", "- Plot the predicted values along with the random data points and true function as given above.\n", "\n", "\n", "## Hints: \n", "\n", "FUNCTION SIGNATURE:\n", "gen(degree, number of samples, number of points, x, y)\n", "\n", "sklearn.PolynomialFeatures()\n", "Generates polynomial and interaction features\n", "\n", "sklearn.LinearRegression()\n", "LinearRegression fits a linear model\n", "\n", "sklearn.fit()\n", "Fits the linear model to the training data\n", "\n", "sklearn.predict()\n", "Predict using the linear model.\n", "\n", "Note: This exercise is **auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Import necessary libraries\n", "%matplotlib inline\n", "import scipy as sp\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib as mpl\n", "import matplotlib.cm as cm\n", "import matplotlib.pyplot as plt\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import PolynomialFeatures\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Helper function to define plot characteristics\n", "def make_plot():\n", " fig, axes=plt.subplots(figsize=(20,8), nrows=1, ncols=2);\n", " axes[0].set_ylabel(\"$p_R$\", fontsize=18)\n", " axes[0].set_xlabel(\"$x$\", fontsize=18)\n", " axes[1].set_xlabel(\"$x$\", fontsize=18)\n", " axes[1].set_yticklabels([])\n", " axes[0].set_ylim([0,1])\n", " axes[1].set_ylim([0,1])\n", " axes[0].set_xlim([0,1])\n", " axes[1].set_xlim([0,1])\n", " plt.tight_layout();\n", " return axes\n", " " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Reading the file into a dataframe\n", "df = pd.read_csv(\"noisypopulation.csv\")\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "###edTest(get_data)###\n", "\n", "# Set column x is the predictor and column y is the response variable.\n", "# Column f is the true function of the given data\n", "# Select the values of the columns\n", "\n", "x = df.___\n", "f = df.___\n", "y = df.___\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Function to compute the Polynomial Features for the data x \n", "# for the given degree d\n", "def polyshape(d, x):\n", " return PolynomialFeatures(___).fit_transform(___.reshape(-1,1))\n", " " ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Function to fit a Linear Regression model \n", "def make_predict_with_model(x, y, x_pred):\n", " \n", " # Create a Linear Regression model with fit_intercept as False\n", " lreg = ___\n", " \n", " # Fit the model to the data x and y got parameters to the function\n", " lreg.fit(___, ___)\n", " \n", " # Predict on the x_pred data got as a parameter to this function\n", " y_pred = lreg.predict(___)\n", "\n", " # Return the linear model and the prediction on the test data\n", " return lreg, y_pred\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Function to perform sampling and fit the data, with the following parameters\n", "\n", "# degree is the maximum degree of the model\n", "# num_sample is the number of samples\n", "# size is the number of random points selected from the data for each sample\n", "# x is the predictor variable\n", "# y is the response variable\n", "\n", "def gen(degree, num_sample, size, x, y):\n", " \n", " # Create 2 lists to store the prediction and model\n", " predicted_values, linear_models =[], []\n", " \n", " # Loop over the number of samples\n", " for i in range(num_sample):\n", " \n", " # Helper code to call the make_predict_with_model function to fit on the data\n", " indexes=np.sort(np.random.choice(x.shape[0], size=size, replace=False))\n", " \n", " # lreg and y_pred hold the model and predicted values for the current sample\n", " lreg, y_pred = make_predict_with_model(polyshape(degree, x[indexes]), y[indexes], polyshape(degree, x))\n", " \n", " # Append the model and predicted values to the appropriate lists\n", " predicted_values.append(___)\n", " linear_models.append(___)\n", " \n", " # Return the 2 lists, one for predicted values and one for the model\n", " return predicted_values, linear_models\n", " " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# Call the function gen() twice with x and y as the \n", "# predictor and response variable respectively\n", "\n", "# Set the number of samples to 200 and the number of points as 30\n", "# Store the return values in appropriate variables\n", "\n", "# Get results for degree 1\n", "predicted_1, model_1 = gen(___);\n", "\n", "# Get results for degree 100\n", "predicted_100, model_100 = gen(___);\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the data\n", "indexes = np.sort(np.random.choice(x.shape[0], size=30, replace=False))\n", "\n", "plt.figure(figsize = (12,8))\n", "axes=make_plot()\n", "\n", "# Plot for Degree 1\n", "axes[0].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n", "axes[0].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n", "axes[0].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n", "\n", "for i,p in enumerate(predicted_1[:-1]):\n", " axes[0].plot(x,p,alpha=0.03,color='#FF9300')\n", "axes[0].plot(x, predicted_1[-1], alpha=0.3,color='#FF9300',label=\"Degree 1 from different samples\")\n", "\n", "\n", "# Plot for Degree 100\n", "axes[1].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n", "axes[1].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n", "axes[1].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n", "\n", "\n", "for i,p in enumerate(predicted_100[:-1]):\n", " axes[1].plot(x,p,alpha=0.03,color='#FF9300')\n", "axes[1].plot(x,predicted_100[-1],alpha=0.2,color='#FF9300',label=\"Degree 100 from different samples\")\n", "\n", "axes[0].legend(loc='best')\n", "axes[1].legend(loc='best')\n", "\n", "plt.show();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ⏸ Does changing the degree from 100 to 10 reduce variance? Why or why not?\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Submit an answer choice as a string below \n", "answer1 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }