{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Exercise: Best Degree of Polynomial using Cross-validation\n",
    "\n",
    "## Description :\n",
    "The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and cross-validation error graphs as shown below.\n",
    "\n",
    "<img src=\"../fig/fig1.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Data Description:\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the dataset and split into train and validation sets.\n",
    "- Select a max degree value for the polynomial model.\n",
    "- For each degree:\n",
    "    - Perform k-fold cross validation\n",
    "    - Fit a polynomial regression model for each degree on the training data and predict on the validation data\n",
    "- Compute the train, validation and cross-validation error as MSE values and store them in separate lists.\n",
    "- Print the best degree of the model for both validation and cross-validation approaches.\n",
    "- Plot the train and cross-validation errors for each degree.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\" target=\"_blank\"></a>pd.read_csv(filename)</a>\n",
    "Returns a pandas dataframe containing the data and labels from the file data.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\" target=\"_blank\">sklearn.train_test_split()</a>\n",
    "Splits the data into random train and test subsets.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html\" target=\"_blank\">sklearn.PolynomialFeatures()</a>\n",
    "Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html\" target=\"_blank\">sklearn.cross_validate()</a>\n",
    "Evaluate metric(s) by cross-validation and also record fit/score times.\n",
    "\n",
    "<img src=\"../fig/fig2.png\" style=\"width: 500px;\">\n",
    "\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html\" target=\"_blank\">sklearn.fit_transform()</a>\n",
    "Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\" target=\"_blank\">sklearn.LinearRegression(fit_intercept=False)</a>\n",
    "LinearRegression fits a linear model.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit\" target=\"_blank\">sklearn.fit()</a>\n",
    "Fits the linear model to the training data.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict\" target=\"_blank\">sklearn.predict()</a>\n",
    "Predict using the linear model.\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.subplots.html\" target=\"_blank\">plt.subplots()</a>\n",
    "Create a figure and a set of subplots.\n",
    "\n",
    "<a href=\"https://docs.python.org/3/library/operator.html\" target=\"_blank\">operator.itemgetter()</a>\n",
    "Return a callable object that fetches item from its operand.\n",
    "\n",
    "<a href=\"https://docs.python.org/3.3/library/functions.html#zip\" target=\"_blank\">zip()</a>\n",
    "Makes an iterator that aggregates elements from each of the iterables.\n",
    "\n",
    "**Note:** This exercise is auto-graded and you can try multiple attempts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "%matplotlib inline\n",
    "import operator\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Read the file \"dataset.csv\" as a Pandas dataframe \n",
    "df = pd.read_csv(\"dataset.csv\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Assign the values of column x as the predictor\n",
    "x = df[['x']].values\n",
    "\n",
    "# Assign the values of column y as the response variable\n",
    "y = df.y.values\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train-validation split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### edTest(test_random) ###\n",
    "# Split the data into train and validation sets with 75% for training \n",
    "# and with a random_state=1\n",
    "x_train, x_val, y_train, y_val = train_test_split(___)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the MSE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_regression) ###\n",
    "\n",
    "# To iterate over the range, select the maximum degree of the polynomial\n",
    "maxdeg = 10\n",
    "\n",
    "# Create three empty lists to store training, validation and cross-validation MSEs\n",
    "training_error, validation_error, cross_validation_error = [],[],[]\n",
    "\n",
    "# Loop through the degrees of the polynomial\n",
    "for d in range(___):\n",
    "    \n",
    "    # Compute the polynomial features for the entire data\n",
    "    x_poly = PolynomialFeatures(___).fit_transform(___)\n",
    "\n",
    "    # Compute the polynomial features for the train data\n",
    "    x_poly_train = PolynomialFeatures(___).fit_transform(___)\n",
    "\n",
    "    # Compute the polynomial features for the validation data\n",
    "    x_poly_val = PolynomialFeatures(___).fit_transform(___)\n",
    "\n",
    "    # Initialize a Linear Regression object\n",
    "    lreg = LinearRegression()\n",
    "  \n",
    "    # Fit model on the training set\n",
    "    lreg.fit(___)\n",
    "\n",
    "    # Predict on the training data\n",
    "    y_train_pred = lreg.predict(___)\n",
    "\n",
    "    # Predict on the validation set\n",
    "    y_val_pred = lreg.predict(___)\n",
    "    \n",
    "    # Compute the mse on the train data\n",
    "    training_error.append(mean_squared_error(___))\n",
    "\n",
    "    # Compute the mse on the validation data\n",
    "    validation_error.append(mean_squared_error(___))\n",
    "    \n",
    "    # Perform cross-validation on the entire data with 10 folds and \n",
    "    # get the mse_scores\n",
    "    mse_score = cross_validate(___)\n",
    "\n",
    "    # Compute the mean of the cross validation error and store in list \n",
    "    # Remember to take into account the sign of the MSE metric returned by the cross_validate function \n",
    "    cross_validation_error.append(___)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding the best degree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_best_degree) ###\n",
    "\n",
    "# Get the best degree associated with the lowest validation error\n",
    "min_mse = min(___)\n",
    "best_degree = validation_error.index(___)\n",
    "\n",
    "\n",
    "# Get the best degree associated with the lowest cross-validation error\n",
    "min_cross_val_mse = min(___)\n",
    "best_cross_val_degree = cross_validation_error.index(___)\n",
    "\n",
    "# Print the values\n",
    "print(\"The best degree of the model using validation is\",best_degree)\n",
    "print(\"The best degree of the model using cross-validation is\",best_cross_val_degree)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plotting the error graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the errors as a function of increasing d value to visualise the training and validation errors\n",
    "fig, ax = plt.subplots(1,2, figsize=(16,8))\n",
    "\n",
    "# Plot the training error with labels\n",
    "ax[0].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n",
    "\n",
    "# Plot the validation error with labels\n",
    "ax[0].plot(range(maxdeg), np.log(validation_error), label = 'Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n",
    "\n",
    "# Plot the training error with labels\n",
    "ax[1].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n",
    "\n",
    "# Plot the cross-validation error with labels\n",
    "ax[1].plot(range(maxdeg), np.log(cross_validation_error), label = 'Cross-Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n",
    "\n",
    "# Set the plot labels and legends\n",
    "ax[0].set_xlabel('Degree of Polynomial', fontsize=12)\n",
    "ax[0].set_ylabel('Log Mean Squared Error', fontsize=12)\n",
    "ax[0].set_title(\"Log of validation error as a function of degree\")\n",
    "\n",
    "ax[1].set_xlabel('Degree of Polynomial', fontsize=12)\n",
    "ax[1].set_ylabel('Log Mean Squared Error', fontsize=12)\n",
    "ax[1].set_title(\"Log of CV error as a function of degree\")\n",
    "\n",
    "ax[0].legend()\n",
    "ax[1].legend()\n",
    "plt.show();\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ If you run the exercise with a random state of 0, do you notice any change? What conclusion can you draw from this experiment?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Submit an answer choice as a string below \n",
    "answer1 = '___'\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}