{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise: B.1 - Best Degree of Polynomial with Train and Validation sets**\n", "\n", "# Description\n", "The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and validation error graphs as shown below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions:\n", "- Read the dataset and split into train and validation sets\n", "- Select a max degree value for the polynomial model\n", "- Fit a polynomial regression model for each degree to the training data and predict on the validation data\n", "- Compute the train and validation error as MSE values and store in separate lists.\n", "- Find out the best degree of the model.\n", "- Plot the train and validation errors for each degree.\n", "\n", "\n", "# Hints:\n", "\n", "pd.read_csv(filename) : Returns a pandas dataframe containing the data and labels from the file data\n", "\n", "sklearn.train_test_split() : Splits the data into random train and test subsets.\n", "\n", "sklearn.PolynomialFeatures() : Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree\n", "\n", "sklearn.fit_transform() : Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X\n", "\n", "sklearn.LinearRegression() : LinearRegression fits a linear model\n", "\n", "sklearn.fit() : Fits the linear model to the training data\n", "\n", "sklearn.predict() : Predict using the linear model.\n", "\n", "plt.subplots() : Create a figure and a set of subplots\n", "\n", "operator.itemgetter() : Return a callable object that fetches item from its operand\n", "\n", "zip() : Makes an iterator that aggregates elements from each of the iterables.\n", "\n", "**Note: This exercise is auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#import libraries\n", "%matplotlib inline\n", "import operator\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Read the file \"dataset.csv\" as a dataframe\n", "\n", "filename = \"dataset.csv\"\n", "\n", "df = pd.read_csv(filename)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assign the values to the predictor and response variables\n", "\n", "x = df[['x']].___\n", "y = df.y.___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train-validation split" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "### edTest(test_random) ###\n", "\n", "#Split the dataset into train and validation sets with 75% Training set and 25% validation set. \n", "#Set random_state=1\n", "\n", "x_train, x_val, y_train, y_val = train_test_split(___)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing the train and validation error in terms of MSE" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "### edTest(test_regression) ###\n", "# To iterate over the range, select the maximum degree of the polynomial\n", "maxdeg = ___\n", "\n", "# Create two empty lists to store training and validation MSEs\n", "training_error, validation_error = [],[]\n", "\n", "#Run a for loop through the degrees of the polynomial, fit linear regression, predict y values and calculate the training and testing errors and update it to the list\n", "for d in range(maxdeg):\n", " \n", " #Compute the polynomial features for the train and validation sets\n", " x_poly_train = PolynomialFeatures(d).fit_transform(___)\n", " x_poly_val = PolynomialFeatures(d).fit_transform(___)\n", " \n", " lreg = LinearRegression()\n", " lreg.fit(x_poly_train, y_train)\n", " \n", " y_train_pred = lreg.predict(___)\n", " y_val_pred = lreg.predict(___)\n", " \n", " #Compute the train and validation MSE\n", " \n", " training_error.append(mean_squared_error(___))\n", " validation_error.append(mean_squared_error(___))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding the best degree" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### edTest(test_best_degree) ###\n", "\n", "#The best degree is the model with the lowest validation error\n", "\n", "min_mse = min(validation_error)\n", "\n", "best_degree = validation_error.index(min_mse)\n", "\n", "print(\"The best degree of the model is\",best_degree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting the error graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "# Plot the errors as a function of increasing d value to visualise the training and testing errors\n", "\n", "fig, ax = plt.subplots()\n", "\n", "#Plot the training error with labels\n", "\n", "ax.plot(___)\n", "\n", "#Plot the validation error with labels\n", "\n", "ax.plot(___)\n", "\n", "# Set the plot labels and legends\n", "\n", "ax.set_xlabel('Degree of Polynomial')\n", "ax.set_ylabel('Mean Squared Error')\n", "ax.legend(loc = 'best')\n", "ax.set_yscale('log')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Once you have marked your exercise, run again with Random_state = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Do you see any change in the results with change in the random state? If so, what do you think is the reason behind it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Your answer here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }