{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: B.2 - Best Degree of Polynomial using Cross-validation**\n",
    "\n",
    "# Description\n",
    "The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and cross-validation error graphs as shown below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image3.png\" style=\"width: 500px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Read the dataset and split into train and validation sets\n",
    "- Select a max degree value for the polynomial model\n",
    "- For each degree:\n",
    "    - Perform k-fold cross validation\n",
    "    - Fit a polynomial regression model for each degree to the training data and predict on the validation data\n",
    "- Compute the train, validation and cross-validation error as MSE values and store in separate lists.\n",
    "- Print the best degree of the model for both validation and cross-validation approaches.\n",
    "- Plot the train and cross-validation errors for each degree.\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\" target=\"_blank\">pd.read_csv(filename)</a> : Returns a pandas dataframe containing the data and labels from the file data\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a> : Splits the data into random train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html\" target=\"_blank\">sklearn.PolynomialFeatures()</a> : Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree\n",
    "\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html\" target=\"_blank\">sklearn.cross_validate()</a> : Evaluate metric(s) by cross-validation and also record fit/score times.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html\" target=\"_blank\">sklearn.fit_transform()</a> : Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\" target=\"_blank\">sklearn.LinearRegression()</a> : LinearRegression fits a linear model\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit\" target=\"_blank\">sklearn.fit()</a> : Fits the linear model to the training data\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict\" target=\"_blank\">sklearn.predict()</a> : Predict using the linear model.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html\" target=\"_blank\">plt.subplots()</a> : Create a figure and a set of subplots\n",
    "\n",
    "<a href=\"https://docs.python.org/3/library/operator.html\" target=\"_blank\">operator.itemgetter()</a> : Return a callable object that fetches item from its operand\n",
    "\n",
    "<a href=\"https://docs.python.org/3.3/library/functions.html#zip\" target=\"_blank\">zip()</a> : Makes an iterator that aggregates elements from each of the iterables.\n",
    "\n",
    "**Note: This exercise is auto-graded and you can try multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "#import libraries\n",
    "%matplotlib inline\n",
    "import operator\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "#Read the file \"dataset.csv\" as a dataframe\n",
    "\n",
    "filename = \"dataset.csv\"\n",
    "\n",
    "df = pd.read_csv(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# Assign the values to the predictor and response variables\n",
    "\n",
    "x = df[['x']].values\n",
    "y = df.y.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train-validation split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "### edTest(test_random) ###\n",
    "\n",
    "#Split the data into train and validation sets with 75% for training and with a random_state=1\n",
    "x_train, x_val, y_train, y_val = train_test_split(___)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the MSE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_regression) ###\n",
    "\n",
    "# To iterate over the range, select the maximum degree of the polynomial\n",
    "maxdeg = 10\n",
    "\n",
    "# Create three empty lists to store training, validation and cross-validation MSEs\n",
    "training_error, validation_error, cross_validation_error = [],[],[]\n",
    "\n",
    "#Run a for loop through the degrees of the polynomial, fit linear regression, predict y values and calculate the training and testing errors and update it to the list\n",
    "for d in range(___):\n",
    "    \n",
    "    #Compute the polynomial features for the entire data, train data and validation data\n",
    "    x_poly_train = PolynomialFeatures(___).fit_transform(___)\n",
    "    x_poly_val = PolynomialFeatures(___).fit_transform(___)\n",
    "    x_poly = PolynomialFeatures(___).fit_transform(___)\n",
    "\n",
    "    #Get a Linear Regression object\n",
    "    lreg = LinearRegression()\n",
    "    \n",
    "    #Perform cross-validation on the entire data with 10 folds and get the mse_scores\n",
    "    mse_score = cross_validate(___)\n",
    "    \n",
    "    #Fit model on the training set\n",
    "    lreg.fit(___)\n",
    "\n",
    "    #Predict of the training and validation set\n",
    "    y_train_pred = lreg.predict(___)\n",
    "    y_val_pred = lreg.predict(___)\n",
    "    \n",
    "    #Compute the train and validation MSE\n",
    "    training_error.append(mean_squared_error(___))\n",
    "    validation_error.append(mean_squared_error(___))\n",
    "    \n",
    "    #Compute the mean of the cross validation error and store in list \n",
    "    #Remember to take into account the sign of the MSE metric returned by the cross_validate function \n",
    "    cross_validation_error.append(___)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the best degree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_best_degree) ###\n",
    "\n",
    "#The best degree with the lowest validation error\n",
    "min_mse = min(___)\n",
    "best_degree = validation_error.index(___)\n",
    "\n",
    "\n",
    "#The best degree with the lowest cross-validation error\n",
    "min_cross_val_mse = min(___)\n",
    "best_cross_val_degree = cross_validation_error.index(___)\n",
    "\n",
    "\n",
    "print(\"The best degree of the model using validation is\",best_degree)\n",
    "print(\"The best degree of the model using cross-validation is\",best_cross_val_degree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting the error graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the errors as a function of increasing d value to visualise the training and validation errors\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "#Plot the training error with labels\n",
    "ax.plot(range(maxdeg), training_error, label = 'Training error')\n",
    "\n",
    "#Plot the cross-validation error with labels\n",
    "ax.plot(range(maxdeg), cross_validation_error, label = 'Cross-Validation error')\n",
    "\n",
    "# Set the plot labels and legends\n",
    "\n",
    "ax.set_xlabel('Degree of Polynomial')\n",
    "ax.set_ylabel('Mean Squared Error')\n",
    "ax.legend(loc = 'best')\n",
    "ax.set_yscale('log')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Once you have marked your exercise, run again with Random_state = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Do you see any change in the results with change in the random state? If so, what do you think is the reason behind it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Your answer here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}