{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Exercise: B.1 - Best Degree of Polynomial with Train and Validation sets**\n",
"\n",
"# Description\n",
"The aim of this exercise is to find the **best degree** of polynomial based on the MSE values. Further, plot the train and validation error graphs as shown below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instructions:\n",
"- Read the dataset and split into train and validation sets\n",
"- Select a max degree value for the polynomial model\n",
"- Fit a polynomial regression model for each degree to the training data and predict on the validation data\n",
"- Compute the train and validation error as MSE values and store in separate lists.\n",
"- Find out the best degree of the model.\n",
"- Plot the train and validation errors for each degree.\n",
"\n",
"\n",
"# Hints:\n",
"\n",
"pd.read_csv(filename) : Returns a pandas dataframe containing the data and labels from the file data\n",
"\n",
"sklearn.train_test_split() : Splits the data into random train and test subsets.\n",
"\n",
"sklearn.PolynomialFeatures() : Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree\n",
"\n",
"sklearn.fit_transform() : Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X\n",
"\n",
"sklearn.LinearRegression() : LinearRegression fits a linear model\n",
"\n",
"sklearn.fit() : Fits the linear model to the training data\n",
"\n",
"sklearn.predict() : Predict using the linear model.\n",
"\n",
"plt.subplots() : Create a figure and a set of subplots\n",
"\n",
"operator.itemgetter() : Return a callable object that fetches item from its operand\n",
"\n",
"zip() : Makes an iterator that aggregates elements from each of the iterables.\n",
"\n",
"**Note: This exercise is auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#import libraries\n",
"%matplotlib inline\n",
"import operator\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading the dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"#Read the file \"dataset.csv\" as a dataframe\n",
"\n",
"filename = \"dataset.csv\"\n",
"\n",
"df = pd.read_csv(filename)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Assign the values to the predictor and response variables\n",
"\n",
"x = df[['x']].___\n",
"y = df.y.___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train-validation split"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"### edTest(test_random) ###\n",
"\n",
"#Split the dataset into train and validation sets with 75% Training set and 25% validation set. \n",
"#Set random_state=1\n",
"\n",
"x_train, x_val, y_train, y_val = train_test_split(___)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing the train and validation error in terms of MSE"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_regression) ###\n",
"# To iterate over the range, select the maximum degree of the polynomial\n",
"maxdeg = ___\n",
"\n",
"# Create two empty lists to store training and validation MSEs\n",
"training_error, validation_error = [],[]\n",
"\n",
"#Run a for loop through the degrees of the polynomial, fit linear regression, predict y values and calculate the training and testing errors and update it to the list\n",
"for d in range(maxdeg):\n",
" \n",
" #Compute the polynomial features for the train and validation sets\n",
" x_poly_train = PolynomialFeatures(d).fit_transform(___)\n",
" x_poly_val = PolynomialFeatures(d).fit_transform(___)\n",
" \n",
" lreg = LinearRegression()\n",
" lreg.fit(x_poly_train, y_train)\n",
" \n",
" y_train_pred = lreg.predict(___)\n",
" y_val_pred = lreg.predict(___)\n",
" \n",
" #Compute the train and validation MSE\n",
" \n",
" training_error.append(mean_squared_error(___))\n",
" validation_error.append(mean_squared_error(___))\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding the best degree"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_best_degree) ###\n",
"\n",
"#The best degree is the model with the lowest validation error\n",
"\n",
"min_mse = min(validation_error)\n",
"\n",
"best_degree = validation_error.index(min_mse)\n",
"\n",
"print(\"The best degree of the model is\",best_degree)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting the error graph"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"# Plot the errors as a function of increasing d value to visualise the training and testing errors\n",
"\n",
"fig, ax = plt.subplots()\n",
"\n",
"#Plot the training error with labels\n",
"\n",
"ax.plot(___)\n",
"\n",
"#Plot the validation error with labels\n",
"\n",
"ax.plot(___)\n",
"\n",
"# Set the plot labels and legends\n",
"\n",
"ax.set_xlabel('Degree of Polynomial')\n",
"ax.set_ylabel('Mean Squared Error')\n",
"ax.legend(loc = 'best')\n",
"ax.set_yscale('log')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Once you have marked your exercise, run again with Random_state = 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Do you see any change in the results with change in the random state? If so, what do you think is the reason behind it?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Your answer here"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}