{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: B.1 - Best Degree of Polynomial with Train and Validation sets**\n",
    "\n",
    "# Description\n",
    "The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and validation error graphs as shown below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image2.png\" style=\"width: 500px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Read the dataset and split into train and validation sets\n",
    "- Select a max degree value for the polynomial model\n",
    "- Fit a polynomial regression model for each degree to the training data and predict on the validation data\n",
    "- Compute the train and validation error as MSE values and store in separate lists.\n",
    "- Find out the best degree of the model.\n",
    "- Plot the train and validation errors for each degree.\n",
    "\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\" target=\"_blank\">pd.read_csv(filename)</a> : Returns a pandas dataframe containing the data and labels from the file data\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a> : Splits the data into random train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html\" target=\"_blank\">sklearn.PolynomialFeatures()</a> : Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html\" target=\"_blank\">sklearn.fit_transform()</a> : Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\" target=\"_blank\">sklearn.LinearRegression()</a> : LinearRegression fits a linear model\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit\" target=\"_blank\">sklearn.fit()</a> : Fits the linear model to the training data\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict\" target=\"_blank\">sklearn.predict()</a> : Predict using the linear model.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html\" target=\"_blank\">plt.subplots()</a> : Create a figure and a set of subplots\n",
    "\n",
    "<a href=\"https://docs.python.org/3/library/operator.html\" target=\"_blank\">operator.itemgetter()</a> : Return a callable object that fetches item from its operand\n",
    "\n",
    "<a href=\"https://docs.python.org/3.3/library/functions.html#zip\" target=\"_blank\">zip()</a> : Makes an iterator that aggregates elements from each of the iterables.\n",
    "\n",
    "**Note: This exercise is auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import libraries\n",
    "%matplotlib inline\n",
    "import operator\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Read the file \"dataset.csv\" as a dataframe\n",
    "\n",
    "filename = \"dataset.csv\"\n",
    "\n",
    "df = pd.read_csv(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Assign the values to the predictor and response variables\n",
    "\n",
    "x = df[['x']].___\n",
    "y = df.y.___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train-validation split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "### edTest(test_random) ###\n",
    "\n",
    "#Split the dataset into train and validation sets with 75% Training set and 25% validation set. \n",
    "#Set random_state=1\n",
    "\n",
    "x_train, x_val, y_train, y_val = train_test_split(___)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the train and validation error in terms of MSE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_regression) ###\n",
    "# To iterate over the range, select the maximum degree of the polynomial\n",
    "maxdeg = ___\n",
    "\n",
    "# Create two empty lists to store training and validation MSEs\n",
    "training_error, validation_error = [],[]\n",
    "\n",
    "#Run a for loop through the degrees of the polynomial, fit linear regression, predict y values and calculate the training and testing errors and update it to the list\n",
    "for d in range(maxdeg):\n",
    "    \n",
    "    #Compute the polynomial features for the train and validation sets\n",
    "    x_poly_train = PolynomialFeatures(d).fit_transform(___)\n",
    "    x_poly_val = PolynomialFeatures(d).fit_transform(___)\n",
    "    \n",
    "    lreg = LinearRegression()\n",
    "    lreg.fit(x_poly_train, y_train)\n",
    "    \n",
    "    y_train_pred = lreg.predict(___)\n",
    "    y_val_pred = lreg.predict(___)\n",
    "    \n",
    "    #Compute the train and validation MSE\n",
    "    \n",
    "    training_error.append(mean_squared_error(___))\n",
    "    validation_error.append(mean_squared_error(___))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the best degree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_best_degree) ###\n",
    "\n",
    "#The best degree is the model with the lowest validation error\n",
    "\n",
    "min_mse = min(validation_error)\n",
    "\n",
    "best_degree = validation_error.index(min_mse)\n",
    "\n",
    "print(\"The best degree of the model is\",best_degree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting the error graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Plot the errors as a function of increasing d value to visualise the training and testing errors\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "#Plot the training error with labels\n",
    "\n",
    "ax.plot(___)\n",
    "\n",
    "#Plot the validation error with labels\n",
    "\n",
    "ax.plot(___)\n",
    "\n",
    "# Set the plot labels and legends\n",
    "\n",
    "ax.set_xlabel('Degree of Polynomial')\n",
    "ax.set_ylabel('Mean Squared Error')\n",
    "ax.legend(loc = 'best')\n",
    "ax.set_yscale('log')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Once you have marked your exercise, run again with Random_state = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Do you see any change in the results with change in the random state? If so, what do you think is the reason behind it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Your answer here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}