{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise: Best Degree of Polynomial using Cross-validation\n",
"\n",
"## Description :\n",
"The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and cross-validation error graphs as shown below.\n",
"\n",
"
\n",
"\n",
"## Data Description:\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the dataset and split into train and validation sets.\n",
"- Select a max degree value for the polynomial model.\n",
"- For each degree:\n",
" - Perform k-fold cross validation\n",
" - Fit a polynomial regression model for each degree on the training data and predict on the validation data\n",
"- Compute the train, validation and cross-validation error as MSE values and store them in separate lists.\n",
"- Print the best degree of the model for both validation and cross-validation approaches.\n",
"- Plot the train and cross-validation errors for each degree.\n",
"\n",
"## Hints: \n",
"\n",
"pd.read_csv(filename)\n",
"Returns a pandas dataframe containing the data and labels from the file data.\n",
"\n",
"sklearn.train_test_split()\n",
"Splits the data into random train and test subsets.\n",
"\n",
"sklearn.PolynomialFeatures()\n",
"Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.\n",
"\n",
"sklearn.cross_validate()\n",
"Evaluate metric(s) by cross-validation and also record fit/score times.\n",
"\n",
"
\n",
"\n",
"\n",
"sklearn.fit_transform()\n",
"Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.\n",
"\n",
"sklearn.LinearRegression(fit_intercept=False)\n",
"LinearRegression fits a linear model.\n",
"\n",
"sklearn.fit()\n",
"Fits the linear model to the training data.\n",
"\n",
"sklearn.predict()\n",
"Predict using the linear model.\n",
"\n",
"plt.subplots()\n",
"Create a figure and a set of subplots.\n",
"\n",
"operator.itemgetter()\n",
"Return a callable object that fetches item from its operand.\n",
"\n",
"zip()\n",
"Makes an iterator that aggregates elements from each of the iterables.\n",
"\n",
"**Note:** This exercise is auto-graded and you can try multiple attempts."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"%matplotlib inline\n",
"import operator\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import cross_validate\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import PolynomialFeatures\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading the dataset"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read the file \"dataset.csv\" as a Pandas dataframe \n",
"df = pd.read_csv(\"dataset.csv\")\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Assign the values of column x as the predictor\n",
"x = df[['x']].values\n",
"\n",
"# Assign the values of column y as the response variable\n",
"y = df.y.values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train-validation split"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"### edTest(test_random) ###\n",
"# Split the data into train and validation sets with 75% for training \n",
"# and with a random_state=1\n",
"x_train, x_val, y_train, y_val = train_test_split(___)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing the MSE"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_regression) ###\n",
"\n",
"# To iterate over the range, select the maximum degree of the polynomial\n",
"maxdeg = 10\n",
"\n",
"# Create three empty lists to store training, validation and cross-validation MSEs\n",
"training_error, validation_error, cross_validation_error = [],[],[]\n",
"\n",
"# Loop through the degrees of the polynomial\n",
"for d in range(___):\n",
" \n",
" # Compute the polynomial features for the entire data\n",
" x_poly = PolynomialFeatures(___).fit_transform(___)\n",
"\n",
" # Compute the polynomial features for the train data\n",
" x_poly_train = PolynomialFeatures(___).fit_transform(___)\n",
"\n",
" # Compute the polynomial features for the validation data\n",
" x_poly_val = PolynomialFeatures(___).fit_transform(___)\n",
"\n",
" # Initialize a Linear Regression object\n",
" lreg = LinearRegression()\n",
" \n",
" # Fit model on the training set\n",
" lreg.fit(___)\n",
"\n",
" # Predict on the training data\n",
" y_train_pred = lreg.predict(___)\n",
"\n",
" # Predict on the validation set\n",
" y_val_pred = lreg.predict(___)\n",
" \n",
" # Compute the mse on the train data\n",
" training_error.append(mean_squared_error(___))\n",
"\n",
" # Compute the mse on the validation data\n",
" validation_error.append(mean_squared_error(___))\n",
" \n",
" # Perform cross-validation on the entire data with 10 folds and \n",
" # get the mse_scores\n",
" mse_score = cross_validate(___)\n",
"\n",
" # Compute the mean of the cross validation error and store in list \n",
" # Remember to take into account the sign of the MSE metric returned by the cross_validate function \n",
" cross_validation_error.append(___)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding the best degree"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_best_degree) ###\n",
"\n",
"# Get the best degree associated with the lowest validation error\n",
"min_mse = min(___)\n",
"best_degree = validation_error.index(___)\n",
"\n",
"\n",
"# Get the best degree associated with the lowest cross-validation error\n",
"min_cross_val_mse = min(___)\n",
"best_cross_val_degree = cross_validation_error.index(___)\n",
"\n",
"# Print the values\n",
"print(\"The best degree of the model using validation is\",best_degree)\n",
"print(\"The best degree of the model using cross-validation is\",best_cross_val_degree)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting the error graph"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Plot the errors as a function of increasing d value to visualise the training and validation errors\n",
"fig, ax = plt.subplots(1,2, figsize=(16,8))\n",
"\n",
"# Plot the training error with labels\n",
"ax[0].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n",
"\n",
"# Plot the validation error with labels\n",
"ax[0].plot(range(maxdeg), np.log(validation_error), label = 'Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n",
"\n",
"# Plot the training error with labels\n",
"ax[1].plot(range(maxdeg), np.log(training_error), label = 'Training error', linewidth=3, color='#FF7E79', alpha=0.4)\n",
"\n",
"# Plot the cross-validation error with labels\n",
"ax[1].plot(range(maxdeg), np.log(cross_validation_error), label = 'Cross-Validation error', linewidth=3, color=\"#007D66\", alpha=0.4)\n",
"\n",
"# Set the plot labels and legends\n",
"ax[0].set_xlabel('Degree of Polynomial', fontsize=12)\n",
"ax[0].set_ylabel('Log Mean Squared Error', fontsize=12)\n",
"ax[0].set_title(\"Log of validation error as a function of degree\")\n",
"\n",
"ax[1].set_xlabel('Degree of Polynomial', fontsize=12)\n",
"ax[1].set_ylabel('Log Mean Squared Error', fontsize=12)\n",
"ax[1].set_title(\"Log of CV error as a function of degree\")\n",
"\n",
"ax[0].legend()\n",
"ax[1].legend()\n",
"plt.show();\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸ If you run the exercise with a random state of 0, do you notice any change? What conclusion can you draw from this experiment?\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow1) ###\n",
"# Submit an answer choice as a string below \n",
"answer1 = '___'\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}