{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Best Degree of Polynomial using Cross-validation\n", "\n", "## Description :\n", "The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and cross-validation error graphs as shown below.\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "- Read the dataset and split into train and validation sets.\n", "- Select a max degree value for the polynomial model.\n", "- For each degree:\n", " - Perform k-fold cross validation\n", " - Fit a polynomial regression model for each degree on the training data and predict on the validation data\n", "- Compute the train, validation and cross-validation error as MSE values and store them in separate lists.\n", "- Print the best degree of the model for both validation and cross-validation approaches.\n", "- Plot the train and cross-validation errors for each degree.\n", "\n", "## Hints: \n", "\n", "pd.read_csv(filename)\n", "Returns a pandas dataframe containing the data and labels from the file data.\n", "\n", "sklearn.train_test_split()\n", "Splits the data into random train and test subsets.\n", "\n", "sklearn.PolynomialFeatures()\n", "Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.\n", "\n", "sklearn.cross_validate()\n", "Evaluate metric(s) by cross-validation and also record fit/score times.\n", "\n", "\n", "\n", "\n", "sklearn.fit_transform()\n", "Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.\n", "\n", "sklearn.LinearRegression(fit_intercept=False)\n", "LinearRegression fits a linear model.\n", "\n", "sklearn.fit()\n", "Fits the linear model to the training data.\n", "\n", "sklearn.predict()\n", "Predict using the linear model.\n", "\n", "plt.subplots()\n", "Create a figure and a set of subplots.\n", "\n", "operator.itemgetter()\n", "Return a callable object that fetches item from its operand.\n", "\n", "zip()\n", "Makes an iterator that aggregates elements from each of the iterables.\n", "\n", "**Note:** This exercise is auto-graded and you can try multiple attempts." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import necessary libraries\n", "%matplotlib inline\n", "import operator\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import cross_validate\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the dataset" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Read the file \"dataset.csv\" as a Pandas dataframe \n", "df = pd.read_csv(\"dataset.csv\")\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Assign the values of column x as the predictor\n", "x = df[['x']].values\n", "\n", "# Assign the values of column y as the response variable\n", "y = df.y.values\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train-validation split" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": true }, "outputs": [], "source": [ "### edTest(test_random) ###\n", "# Split the data into train and validation sets with 75% for training \n", "# and with a random_state=1\n", "x_train, x_val, y_train, y_val = train_test_split(___)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing the MSE" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "### edTest(test_regression) ###\n", "\n", "# To iterate over the range, select the maximum degree of the polynomial\n", "maxdeg = 10\n", "\n", "# Create three empty lists to store training, validation and cross-validation MSEs\n", "training_error, validation_error, cross_validation_error = [],[],[]\n", "\n", "# Loop through the degrees of the polynomial\n", "for d in range(___):\n", " \n", " # Compute the polynomial features for the entire data\n", " x_poly = PolynomialFeatures(___).fit_transform(___)\n", "\n", " # Compute the polynomial features for the train data\n", " x_poly_train = PolynomialFeatures(___).fit_transform(___)\n", "\n", " # Compute the polynomial features for the validation data\n", " x_poly_val = PolynomialFeatures(___).fit_transform(___)\n", "\n", " # Initialize a Linear Regression object\n", " lreg = LinearRegression()\n", " \n", " # Fit model on the training set\n", " lreg.fit(___)\n", "\n", " # Predict on the training data\n", " y_train_pred = lreg.predict(___)\n", "\n", " # Predict on the validation set\n", " y_val_pred = lreg.predict(___)\n", " \n", " # Compute the mse on the train data\n", " training_error.append(mean_squared_error(___))\n", "\n", " # Compute the mse on the validation data\n", " validation_error.append(mean_squared_error(___))\n", " \n", " # Perform cross-validation on the entire data with 10 folds and \n", " # get the mse_scores\n", " mse_score = cross_validate(___)\n", "\n", " # Compute the mean of the cross validation error and store in list \n", " # Remember to take into account the sign of the MSE metric returned by the cross_validate function \n", " cross_validation_error.append(___)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding the best degree" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_best_degree) ###\n", "\n", "# Get the best degree associated with the lowest validation error\n", "min_mse = min(___)\n", "best_degree = validation_error.index(___)\n", "\n", "\n", "# Get the best degree associated with the lowest cross-validation error\n", "min_cross_val_mse = min(___)\n", "best_cross_val_degree = cross_validation_error.index(___)\n", "\n", "# Print the values\n", "print(\"The best degree of the model using validation is\",best_degree)\n", "print(\"The best degree of the model using cross-validation is\",best_cross_val_degree)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting the error graph" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Plot the errors as a function of increasing d value to visualise the training and validation errors\n", "fig, ax = plt.subplots(1,2, figsize=(16,8))\n", "\n", "# Plot the training error with labels\n", "ax[0].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n", "\n", "# Plot the validation error with labels\n", "ax[0].plot(range(maxdeg), np.log(validation_error), label = 'Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n", "\n", "# Plot the training error with labels\n", "ax[1].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n", "\n", "# Plot the cross-validation error with labels\n", "ax[1].plot(range(maxdeg), np.log(cross_validation_error), label = 'Cross-Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n", "\n", "# Set the plot labels and legends\n", "ax[0].set_xlabel('Degree of Polynomial', fontsize=12)\n", "ax[0].set_ylabel('Log Mean Squared Error', fontsize=12)\n", "ax[0].set_title(\"Log of validation error as a function of degree\")\n", "\n", "ax[1].set_xlabel('Degree of Polynomial', fontsize=12)\n", "ax[1].set_ylabel('Log Mean Squared Error', fontsize=12)\n", "ax[1].set_title(\"Log of CV error as a function of degree\")\n", "\n", "ax[0].legend()\n", "ax[1].legend()\n", "plt.show();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ If you run the exercise with a random state of 0, do you notice any change? What conclusion can you draw from this experiment?\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Submit an answer choice as a string below \n", "answer1 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }