{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Title\n",
    "\n",
    "**Exercise: 1 - Regularization with Cross-validation**\n",
    "\n",
    "# Description\n",
    "\n",
    "The aim of this exercise is to understand regularization with cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../img/image.png\" style=\"width: 500px;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions:\n",
    "- Initialising the required parameters for this exercise. This can be viewed in the scaffold.\n",
    "- Read the data file `polynomial50.csv` and assign the predictor and response variables.\n",
    "- Use the helper code to visualise the data.\n",
    "- Define a function `reg_with_validation` that performs Ridge regularization by taking a random_state parameter.\n",
    "    - Split the data into train and validation sets by specifying the random_state.\n",
    "    - Compute the polynomial features for the train and validation sets.\n",
    "    - Run a loop for the alpha values. Within the loop:\n",
    "        - Initialise the Ridge regression model with the specified alpha.\n",
    "        - Fit the model on the training data and predict and on the train and validation set.\n",
    "        - Compute the MSE of the train and validation prediction.\n",
    "        - Store these values in lists.\n",
    "- Run `reg_with_validation` for varying random states and plot a graph that depicts the best alpha value and the best MSE. The graph will be similar to the one given above.\n",
    "- Define a function `reg_with_cross_validation` that performs Ridge regularization with cross-validation by taking a random_state parameter.\n",
    "    - Sample the data using the specified random state.\n",
    "    - Assign the predictor and response variables using the sampled data.\n",
    "    - Run a loop for the alpha values. Within the loop:\n",
    "        - Initialise the Ridge regression model with the specified alpha.\n",
    "        - Fit the model on the entire data and using cross-validation with 5 folds.\n",
    "        - Get the train and validation MSEs by taking their mean.\n",
    "        - Store these values in lists.\n",
    "- Run `reg_with_cross_validation` for varying random states and plot a graph that depicts the best alpha value and the best MSE.\n",
    "- Use the helper code given to print your best MSEs in the case of simple validation and cross-validation for different random states.\n",
    "\n",
    "\n",
    "# Hints:\n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html\" target=\"_blank\">df.sample()</a> : Returns a random sample of items from an axis of the object.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html\" target=\"_blank\">sklearn.cross_validate()</a> : Evaluate metrics by cross-validation and also record fit/score times.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.mean.html\" target=\"_blank\">np.mean()</a> : Compute the arithmetic mean along the specified axis.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html\" target=\"_blank\">sklearn.RidgeRegression()</a> : Linear least squares with l2 regularization.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge.fit\" target=\"_blank\">sklearn.fit()</a> : Fit Ridge egression model.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge.predict\" target=\"_blank\">sklearn.predict()</a> : Predict using the linear model.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html\" target=\"_blank\">sklearn.mean_squared_error()</a> : Mean squared error regression loss\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html\" target=\"_blank\">sklearn.PolynomialFeatures()</a> : Generate polynomial and interaction features.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures.fit_transform\" target=\"_blank\">sklearn.fit_transform()</a> : Fit to data, then transform it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import required libraries\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from prettytable import PrettyTable\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.model_selection import cross_validate\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialising required parameters\n",
    "\n",
    "# The list of random states\n",
    "ran_state = [0, 10, 21, 42, 66, 109, 310, 1969]\n",
    "\n",
    "# The list of alpha for regularization\n",
    "alphas = [1e-7,1e-5, 1e-3, 0.01, 0.1, 1]\n",
    "\n",
    "# The degree of the polynomial\n",
    "degree= 30\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the file 'polynomial50.csv' as a dataframe\n",
    "df = pd.read_csv('polynomial50.csv')\n",
    "\n",
    "# Assign the values of the 'x' column as the predictor\n",
    "x = df[['x']].values\n",
    "\n",
    "# Assign the values of the 'y' column as the response\n",
    "y = df['y'].values\n",
    "\n",
    "# Also assign the true value of the function (column 'f') to the variable f \n",
    "f = df['f'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the helper code below to visualise the distribution of the x, y values & also the value of the true function f\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "# Plot x vs y values\n",
    "ax.plot(x,y, 'o', label = 'Observed values',markersize=10 ,color = 'Darkblue')\n",
    "\n",
    "# Plot x vs true function value\n",
    "ax.plot(x,f, 'k-', label = 'True function',linewidth=4,color ='#9FC131FF')\n",
    "\n",
    "ax.legend(loc = 'best');\n",
    "ax.set_xlabel('Predictor - $X$',fontsize=16)\n",
    "ax.set_ylabel('Response - $Y$',fontsize=16)\n",
    "ax.set_title('Predictor vs Response plot',fontsize=16);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to perform regularization with simple validation\n",
    "def reg_with_validation(rs):\n",
    "    \n",
    "    # Split the data into train and validation sets with train size as 80% and random_state as\n",
    "    x_train, x_val, y_train, y_val = train_test_split(x,y, train_size = 0.8, random_state=rs)\n",
    "\n",
    "    # Create two lists for training and validation error\n",
    "    training_error, validation_error = [],[]\n",
    "\n",
    "    # Compute the polynomial features train and validation sets\n",
    "    x_poly_train = ___\n",
    "    x_poly_val= ___\n",
    "\n",
    "    # Run a loop for all alpha values\n",
    "    for alpha in alphas:\n",
    "\n",
    "        # Initialise a Ridge regression model by specifying the alpha and with fit_intercept=False\n",
    "        ridge_reg = ___\n",
    "        \n",
    "        # Fit on the modified training data\n",
    "        ___\n",
    "\n",
    "        # Predict on the training set \n",
    "        y_train_pred = ___\n",
    "        \n",
    "        # Predict on the validation set \n",
    "        y_val_pred = ___\n",
    "        \n",
    "        # Compute the training and validation mean squared errors\n",
    "        mse_train = ___\n",
    "        mse_val = ___\n",
    "\n",
    "        # Append the MSEs to their respective lists \n",
    "        training_error.append(mse_train)\n",
    "        validation_error.append(mse_val)\n",
    "    \n",
    "    # Return the train and validation MSE\n",
    "    return training_error, validation_error\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_validation) ###\n",
    "# Initialise a list to store the best alpha using simple validation for varying random states\n",
    "best_alpha = []\n",
    "\n",
    "# Run a loop for different random_states\n",
    "for i in range(len(ran_state)):\n",
    "    \n",
    "    # Get the train and validation error by calling the function reg_with_validation\n",
    "    training_error, validation_error = ___\n",
    "\n",
    "    # Get the best mse from the validation_error list\n",
    "    best_mse  = ___\n",
    "    \n",
    "    # Get the best alpha value based on the best mse\n",
    "    best_parameter = ___\n",
    "    \n",
    "    # Append the best alpha to the list\n",
    "    best_alpha.append(best_parameter)\n",
    "    \n",
    "    # Use the helper code given below to plot the graphs\n",
    "    fig, ax = plt.subplots(figsize = (6,4))\n",
    "    \n",
    "    # Plot the training errors for each alpha value\n",
    "    ax.plot(alphas,training_error,'s--', label = 'Training error',color = 'Darkblue',linewidth=2)\n",
    "    \n",
    "    # Plot the validation errors for each alpha value\n",
    "    ax.plot(alphas,validation_error,'s-', label = 'Validation error',color ='#9FC131FF',linewidth=2 )\n",
    "\n",
    "    # Draw a vertical line at the best parameter\n",
    "    ax.axvline(best_parameter, 0, 0.75, color = 'r', label = f'Min validation error at alpha = {best_parameter}')\n",
    "\n",
    "    ax.set_xlabel('Value of Alpha',fontsize=15)\n",
    "    ax.set_ylabel('Mean Squared Error',fontsize=15)\n",
    "    ax.set_ylim([0,0.010])\n",
    "    ax.legend(loc = 'best',fontsize=12)\n",
    "    bm = round(best_mse, 5)\n",
    "    ax.set_title(f'Best alpha is {best_parameter} with MSE {bm}',fontsize=16)\n",
    "    ax.set_xscale('log')\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to perform regularization with cross validation\n",
    "def reg_with_cross_validation(rs):\n",
    "    \n",
    "    # Sample your data to get different splits using the random state\n",
    "    df_new = ___\n",
    "    \n",
    "    # Assign the values of the 'x' column as the predictor from your sampled dataframe\n",
    "    x = df_new[['x']].values\n",
    "\n",
    "    # Assign the values of the 'y' column as the response from your sampled dataframe\n",
    "    y = df_new['y'].values\n",
    "\n",
    "    # Create two lists for training and validation error\n",
    "    training_error, validation_error = [],[]\n",
    "\n",
    "    # Compute the polynomial features on the entire data\n",
    "    x_poly = ___\n",
    "\n",
    "    # Run a loop for all alpha values\n",
    "    for alpha in alphas:\n",
    "\n",
    "        # Initialise a Ridge regression model by specifying the alpha value and with fit_intercept=False\n",
    "        ridge_reg = ___\n",
    "        \n",
    "        # Perform cross validation on the modified data with neg_mean_squared_error as the scoring parameter and cv=5\n",
    "        # Remember to get the train_score\n",
    "        ridge_cv = ___\n",
    "\n",
    "        # Compute the training and validation errors got after cross validation\n",
    "        mse_train = ___\n",
    "        mse_val = ___\n",
    "        \n",
    "        # Append the MSEs to their respective lists \n",
    "        training_error.append(mse_train)\n",
    "        validation_error.append(mse_val)\n",
    "    \n",
    "    # Return the train and validation MSE\n",
    "    return training_error, validation_error\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_cross_validation) ###\n",
    "# Initialise a list to store the best alpha using cross validation for varying random states\n",
    "best_cv_alpha = []\n",
    "\n",
    "# Run a loop for different random_states\n",
    "for i in range(len(ran_state)):\n",
    "    \n",
    "    # Get the train and validation error by calling the function reg_with_cross_validation\n",
    "    training_error, validation_error = ___\n",
    "    \n",
    "    # Get the best mse from the validation_error list\n",
    "    best_mse  = ___\n",
    "    \n",
    "    # Get the best alpha value based on the best mse\n",
    "    best_parameter = ___\n",
    "    \n",
    "    # Append the best alpha to the list\n",
    "    best_cv_alpha.append(best_parameter)\n",
    "    \n",
    "    # Use the helper code given below to plot the graphs\n",
    "    fig, ax = plt.subplots(figsize = (6,4))\n",
    "    \n",
    "    # Plot the training errors for each alpha value\n",
    "    ax.plot(alphas,training_error,'s--', label = 'Training error',color = 'Darkblue',linewidth=2)\n",
    "    \n",
    "    # Plot the validation errors for each alpha value\n",
    "    ax.plot(alphas,validation_error,'s-', label = 'Validation error',color ='#9FC131FF',linewidth=2 )\n",
    "\n",
    "    # Draw a vertical line at the best parameter\n",
    "    ax.axvline(best_parameter, 0, 0.75, color = 'r', label = f'Min validation error at alpha = {best_parameter}')\n",
    "\n",
    "    ax.set_xlabel('Value of Alpha',fontsize=15)\n",
    "    ax.set_ylabel('Mean Squared Error',fontsize=15)\n",
    "    ax.legend(loc = 'best',fontsize=12)\n",
    "    bm = round(best_mse, 5)\n",
    "    ax.set_title(f'Best alpha is {best_parameter} with MSE {bm}',fontsize=16)\n",
    "    ax.set_xscale('log')\n",
    "    plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the helper code below to print your findings\n",
    "pt = PrettyTable()\n",
    "\n",
    "pt.field_names = [\"Random State\", \"Best Alpha with Validation\", \"Best Alpha with Cross-Validation\"]\n",
    "\n",
    "for i in range(6):\n",
    "    pt.add_row([ran_state[i], best_alpha[i], best_cv_alpha[i]])\n",
    "\n",
    "print(pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What can you infer about cross-validation based on the previous analysis?**\n",
    "\n",
    "**After marking, change the random states and alpha values. Run the code again. Comment on the results of regularization with simple validation and cross-validation.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}