{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Regularization with Cross-validation\n", "\n", "## Description :\n", "The aim of this exercise is to understand regularization with cross-validation.\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "- Initialising the required parameters for this exercise. This can be viewed in the scaffold.\n", "- Read the data file `polynomial50.csv` and assign the predictor and response variables.\n", "- Use the helper code to visualise the data.\n", "- Define a function `reg_with_validation` that performs Ridge regularization by taking a random_state parameter.\n", " - Split the data into train and validation sets by specifying the random_state.\n", " - Compute the polynomial features for the train and validation sets.\n", " - Run a loop for the alpha values. Within the loop:\n", " - Initialise the Ridge regression model with the specified alpha.\n", " - Fit the model on the training data and predict and on the train and validation set.\n", " - Compute the MSE of the train and validation prediction.\n", " - Store these values in lists.\n", "- Run reg_with_validation for varying random states and plot a graph that depicts the best alpha value and the best MSE. The graph will be similar to the one given above.\n", "- Define a function reg_with_cross_validation that performs Ridge regularization with cross-validation by taking a random_state parameter.\n", " - Sample the data using the specified random state.\n", " - Assign the predictor and response variables using the sampled data.\n", " - Run a loop for the alpha values. Within the loop:\n", " - Initialise the Ridge regression model with the specified alpha.\n", " - Fit the model on the entire data and using cross-validation with 5 folds.\n", " - Get the train and validation MSEs by taking their mean.\n", " - Store these values in lists.\n", "- Run `reg_with_cross_validation` for varying random states and plot a graph that depicts the best alpha value and the best MSE.\n", "- Use the helper code given to print your best MSEs in the case of simple validation and cross-validation for different random states.\n", "\n", "## Hints: \n", "\n", "df.sample()\n", "Returns a random sample of items from an axis of the object.\n", "\n", "sklearn.cross_validate()\n", "Evaluate metric(by `cross-validation` and also record fit/score times.\n", "\n", "np.mean()\n", "Compute the arithmetic mean along the specified axis.\n", "\n", "sklearn.RidgeRegression()\n", "Linear least squares with l2 regularization.\n", "\n", "sklearn.fit()\n", "Fit Ridge egression model.\n", "\n", "sklearn.predict()\n", "Predict using the linear model.\n", "\n", "sklearn.mean_squared_error()\n", "Mean squared error regression loss.\n", "\n", "sklearn.PolynomialFeatures()\n", "Generate polynomial and interaction features.\n", "\n", "sklearn.fit_transform()\n", "Fit to data, then transform it." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from prettytable import PrettyTable\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import cross_validate\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n", "%matplotlib inline\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Initialising required parameters\n", "\n", "# The list of random states\n", "ran_state = [0, 10, 21, 42, 66, 109, 310, 1969]\n", "\n", "# The list of alpha for regularization\n", "alphas = [1e-7,1e-5, 1e-3, 0.01, 0.1, 1]\n", "\n", "# The degree of the polynomial\n", "degree= 30\n", " " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Read the file 'polynomial50.csv' as a dataframe\n", "df = pd.read_csv('polynomial50.csv')\n", "\n", "# Assign the values of the 'x' column as the predictor\n", "x = df[['x']].values\n", "\n", "# Assign the values of the 'y' column as the response\n", "y = df['y'].values\n", "\n", "# Also assign the true value of the function (column 'f') to the variable f \n", "f = df['f'].values\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code below to visualise the distribution of the x, y values & also the value of the true function f\n", "fig, ax = plt.subplots()\n", "\n", "# Plot x vs y values\n", "ax.plot(x,y, 'o', label = 'Observed values',markersize=10 ,color = 'Darkblue')\n", "\n", "# Plot x vs true function value\n", "ax.plot(x,f, 'k-', label = 'True function',linewidth=4,color ='#9FC131FF')\n", "\n", "ax.legend(loc = 'best');\n", "ax.set_xlabel('Predictor - $X$',fontsize=16)\n", "ax.set_ylabel('Response - $Y$',fontsize=16)\n", "ax.set_title('Predictor vs Response plot',fontsize=16)\n", "plt.show();\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Function to perform regularization with simple validation\n", "def reg_with_validation(rs):\n", " \n", " # Split the data into train and validation sets with train size \n", " # as 80% and random_state as the value given as the function parameter\n", " x_train, x_val, y_train, y_val = train_test_split(x,y, train_size = 0.8, random_state=rs)\n", "\n", " # Create two lists for training and validation error\n", " training_error, validation_error = [],[]\n", "\n", " # Compute the polynomial features for the train and validation sets\n", " x_poly_train = ___\n", " x_poly_val= ___\n", "\n", " # Run a loop for all alpha values\n", " for alpha in alphas:\n", "\n", " # Initialise a Ridge regression model by specifying the current\n", " # alpha and with fit_intercept=False\n", " ridge_reg = ___\n", " \n", " # Fit on the modified training data\n", " ___\n", "\n", " # Predict on the training set \n", " y_train_pred = ___\n", " \n", " # Predict on the validation set \n", " y_val_pred = ___\n", " \n", " # Compute the training and validation mean squared errors\n", " mse_train = ___\n", " mse_val = ___\n", "\n", " # Append the MSEs to their respective lists \n", " training_error.append(mse_train)\n", " validation_error.append(mse_val)\n", " \n", " # Return the train and validation MSE\n", " return training_error, validation_error\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_validation) ###\n", "# Initialise a list to store the best alpha using simple validation for varying random states\n", "best_alpha = []\n", "\n", "# Run a loop for different random_states\n", "for i in range(len(ran_state)):\n", " \n", " # Get the train and validation error by calling the \n", " # function reg_with_validation\n", " training_error, validation_error = ___\n", "\n", " # Get the best mse from the validation_error list\n", " best_mse = ___\n", " \n", " # Get the best alpha value based on the best mse\n", " best_parameter = ___\n", " \n", " # Append the best alpha to the list\n", " best_alpha.append(best_parameter)\n", " \n", " # Use the helper code given below to plot the graphs\n", " fig, ax = plt.subplots(figsize = (6,4))\n", " \n", " # Plot the training errors for each alpha value\n", " ax.plot(alphas,training_error,'s--', label = 'Training error',color = 'Darkblue',linewidth=2)\n", " \n", " # Plot the validation errors for each alpha value\n", " ax.plot(alphas,validation_error,'s-', label = 'Validation error',color ='#9FC131FF',linewidth=2 )\n", "\n", " # Draw a vertical line at the best parameter\n", " ax.axvline(best_parameter, 0, 0.5, color = 'r', label = f'Min validation error at alpha = {best_parameter}')\n", "\n", " ax.set_xlabel('Value of Alpha',fontsize=15)\n", " ax.set_ylabel('Mean Squared Error',fontsize=15)\n", " ax.set_ylim([0,0.010])\n", " ax.legend(loc = 'upper left',fontsize=16)\n", " bm = round(best_mse, 5)\n", " ax.set_title(f'Best alpha is {best_parameter} with mse {bm}',fontsize=16)\n", " ax.set_xscale('log')\n", " plt.tight_layout()\n", " plt.show()\n", " " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Function to perform regularization with cross validation\n", "def reg_with_cross_validation(rs):\n", " \n", " # Sample the data to get different splits using the random state\n", " df_new = ___\n", " \n", " # Assign the values of the 'x' column as the predictor from your sampled dataframe\n", " x = df_new[['x']].values\n", "\n", " # Assign the values of the 'y' column as the response from your sampled dataframe\n", " y = df_new['y'].values\n", "\n", " # Create two lists for training and validation error\n", " training_error, validation_error = [],[]\n", "\n", " # Compute the polynomial features on the entire data\n", " x_poly = ___\n", "\n", " # Run a loop for all alpha values\n", " for alpha in alphas:\n", "\n", " # Initialise a Ridge regression model by specifying the alpha value and with fit_intercept=False\n", " ridge_reg = ___\n", " \n", " # Perform cross validation on the modified data with neg_mean_squared_error as the scoring parameter and cv=5\n", " # Set return_train_score to True\n", " ridge_cv = ___\n", "\n", " # Compute the training and validation errors got after cross validation\n", " mse_train = ___\n", " mse_val = ___\n", " \n", " # Append the MSEs to their respective lists \n", " training_error.append(mse_train)\n", " validation_error.append(mse_val)\n", " \n", " # Return the train and validation MSE\n", " return training_error, validation_error\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_cross_validation) ###\n", "# Initialise a list to store the best alpha using cross validation for varying random states\n", "best_cv_alpha = []\n", "\n", "# Run a loop for different random_states\n", "for i in range(len(ran_state)):\n", " \n", " # Get the train and validation error by calling the function reg_with_cross_validation\n", " training_error, validation_error = ___\n", " \n", " # Get the best mse from the validation_error list\n", " best_mse = ___\n", " \n", " # Get the best alpha value based on the best mse\n", " best_parameter = ___\n", " \n", " # Append the best alpha to the list\n", " best_cv_alpha.append(___)\n", " \n", " # Use the helper code given below to plot the graphs\n", " fig, ax = plt.subplots(figsize = (6,4))\n", " \n", " # Plot the training errors for each alpha value\n", " ax.plot(alphas,training_error,'s--', label = 'Training error',color = 'Darkblue',linewidth=2)\n", " \n", " # Plot the validation errors for each alpha value\n", " ax.plot(alphas,validation_error,'s-', label = 'Validation error',color ='#9FC131FF',linewidth=2 )\n", "\n", " # Draw a vertical line at the best parameter\n", " ax.axvline(best_parameter, 0, 0.5, color = 'r', label = f'Min validation error at alpha = {best_parameter}')\n", "\n", " ax.set_xlabel('Value of Alpha',fontsize=15)\n", " ax.set_ylabel('Mean Squared Error',fontsize=15)\n", " ax.legend(loc = 'upper left',fontsize=16)\n", " bm = round(best_mse, 5)\n", " ax.set_title(f'Best alpha is {best_parameter} with mse {bm}',fontsize=16)\n", " ax.set_xscale('log')\n", " plt.tight_layout()\n", " " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to print your findings\n", "pt = PrettyTable()\n", "\n", "pt.field_names = [\"Random State\", \"Best Alpha with Validation\", \"Best Alpha with Cross-Validation\"]\n", "\n", "for i in range(6):\n", " pt.add_row([ran_state[i], best_alpha[i], best_cv_alpha[i]])\n", "print(pt)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### ⏸ Comment on the results of regularization with simple validation and cross-validation after changing the random state and alpha values." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Submit an answer choice as a string below \n", "answer1 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }