{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Exercise: Bias Variance Tradeoff\n",
    "    \n",
    "## Description :\n",
    "The aim of this exercise is to understand **bias variance tradeoff**. For this, you will fit a polynomial regression model with different degrees on the same data and plot them as given below.\n",
    "\n",
    "<img src=\"../fig/fig1.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Data Description:\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the file `noisypopulation.csv` as a Pandas dataframe.\n",
    "- Assign the response and predictor variables appropriately as mentioned in the scaffold.\n",
    "- Perform sampling on the dataset to get a subset.\n",
    "- For each sampled version fo the dataset:\n",
    "    - For degree of the chosen degree value:\n",
    "        - Compute the polynomial features for the training\n",
    "        - Fit the model on the given data\n",
    "        - Select a set of random points in the data to predict the model\n",
    "        - Store the predicted values as a list\n",
    "- Plot the predicted values along with the random data points and true function as given above.\n",
    "\n",
    "\n",
    "## Hints: \n",
    "\n",
    "FUNCTION SIGNATURE:\n",
    "gen(degree, number of samples, number of points, x, y)\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html\" target=\"_blank\">sklearn.PolynomialFeatures()</a>\n",
    "Generates polynomial and interaction features\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\" target=\"_blank\">sklearn.LinearRegression()</a>\n",
    "LinearRegression fits a linear model\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit\" target=\"_blank\">sklearn.fit()</a>\n",
    "Fits the linear model to the training data\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict\" target=\"_blank\">sklearn.predict()</a>\n",
    "Predict using the linear model.\n",
    "\n",
    "Note: This exercise is **auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Import necessary libraries\n",
    "%matplotlib inline\n",
    "import scipy as sp\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib as mpl\n",
    "import matplotlib.cm as cm\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.preprocessing import PolynomialFeatures\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Helper function to define plot characteristics\n",
    "def make_plot():\n",
    "    fig, axes=plt.subplots(figsize=(20,8), nrows=1, ncols=2);\n",
    "    axes[0].set_ylabel(\"$p_R$\", fontsize=18)\n",
    "    axes[0].set_xlabel(\"$x$\", fontsize=18)\n",
    "    axes[1].set_xlabel(\"$x$\", fontsize=18)\n",
    "    axes[1].set_yticklabels([])\n",
    "    axes[0].set_ylim([0,1])\n",
    "    axes[1].set_ylim([0,1])\n",
    "    axes[0].set_xlim([0,1])\n",
    "    axes[1].set_xlim([0,1])\n",
    "    plt.tight_layout();\n",
    "    return axes\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Reading the file into a dataframe\n",
    "df = pd.read_csv(\"noisypopulation.csv\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "###edTest(get_data)###\n",
    "\n",
    "# Set column x is the predictor and column y is the response variable.\n",
    "# Column f is the true function of the given data\n",
    "# Select the values of the columns\n",
    "\n",
    "x = df.___\n",
    "f = df.___\n",
    "y = df.___\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to compute the Polynomial Features for the data x \n",
    "# for the given degree d\n",
    "def polyshape(d, x):\n",
    "    return PolynomialFeatures(___).fit_transform(___.reshape(-1,1))\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to fit a Linear Regression model \n",
    "def make_predict_with_model(x, y, x_pred):\n",
    "    \n",
    "    # Create a Linear Regression model with fit_intercept as False\n",
    "    lreg = ___\n",
    "    \n",
    "    # Fit the model to the data x and y got parameters to the function\n",
    "    lreg.fit(___, ___)\n",
    "    \n",
    "    # Predict on the x_pred data got as a parameter to this function\n",
    "    y_pred = lreg.predict(___)\n",
    "\n",
    "    # Return the linear model and the prediction on the test data\n",
    "    return lreg, y_pred\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to perform sampling and fit the data, with the following parameters\n",
    "\n",
    "# degree is the maximum degree of the model\n",
    "# num_sample is the number of samples\n",
    "# size is the number of random points selected from the data for each sample\n",
    "# x is the predictor variable\n",
    "# y is the response variable\n",
    "\n",
    "def gen(degree, num_sample, size, x, y):\n",
    "    \n",
    "    # Create 2 lists to store the prediction and model\n",
    "    predicted_values, linear_models =[], []\n",
    "    \n",
    "    # Loop over the number of samples\n",
    "    for i in range(num_sample):\n",
    "        \n",
    "        # Helper code to call the make_predict_with_model function to fit on the data\n",
    "        indexes=np.sort(np.random.choice(x.shape[0], size=size, replace=False))\n",
    "        \n",
    "        # lreg and y_pred hold the model and predicted values for the current sample\n",
    "        lreg, y_pred = make_predict_with_model(polyshape(degree, x[indexes]), y[indexes], polyshape(degree, x))\n",
    "        \n",
    "        # Append the model and predicted values to the appropriate lists\n",
    "        predicted_values.append(___)\n",
    "        linear_models.append(___)\n",
    "    \n",
    "    # Return the 2 lists, one for predicted values and one for the model\n",
    "    return predicted_values, linear_models\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the function gen() twice with x and y as the \n",
    "# predictor and response variable respectively\n",
    "\n",
    "# Set the number of samples to 200 and the number of points as 30\n",
    "# Store the return values in appropriate variables\n",
    "\n",
    "# Get results for degree 1\n",
    "predicted_1, model_1 = gen(___);\n",
    "\n",
    "# Get results for degree 100\n",
    "predicted_100, model_100 = gen(___);\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper code to plot the data\n",
    "indexes = np.sort(np.random.choice(x.shape[0], size=30, replace=False))\n",
    "\n",
    "plt.figure(figsize = (12,8))\n",
    "axes=make_plot()\n",
    "\n",
    "# Plot for Degree 1\n",
    "axes[0].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n",
    "axes[0].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n",
    "axes[0].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n",
    "\n",
    "for i,p in enumerate(predicted_1[:-1]):\n",
    "    axes[0].plot(x,p,alpha=0.03,color='#FF9300')\n",
    "axes[0].plot(x, predicted_1[-1], alpha=0.3,color='#FF9300',label=\"Degree 1 from different samples\")\n",
    "\n",
    "\n",
    "# Plot for Degree 100\n",
    "axes[1].plot(x,f,label=\"f\", color='darkblue',linewidth=4)\n",
    "axes[1].plot(x, y, '.', label=\"Population y\", color='#009193',markersize=8)\n",
    "axes[1].plot(x[indexes], y[indexes], 's', color='black', label=\"Data y\")\n",
    "\n",
    "\n",
    "for i,p in enumerate(predicted_100[:-1]):\n",
    "    axes[1].plot(x,p,alpha=0.03,color='#FF9300')\n",
    "axes[1].plot(x,predicted_100[-1],alpha=0.2,color='#FF9300',label=\"Degree 100 from different samples\")\n",
    "\n",
    "axes[0].legend(loc='best')\n",
    "axes[1].legend(loc='best')\n",
    "\n",
    "plt.show();\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ⏸ Does changing the degree from 100 to 10 reduce variance? Why or why not?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Submit an answer choice as a string below \n",
    "answer1 = '___'\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}