{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Linear and Polynomial Regression with Residual Analysis\n", "\n", "## Description :\n", "The goal of this exercise is to fit linear regression and polynomial regression to the given data. Plot the fit curves of both the models along with the data and observe what the residuals tell us about the two fits. \n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "- Read the poly.csv file into a dataframe.\n", "- Split the data into train and test subsets.\n", "- Fit a linear regression model on the entire data, using `LinearRegression()` object from Sklearn library.\n", "- Guesstimate the degree of the polynomial which would best fit the data.\n", "- Fit a polynomial regression model on the computed Polynomial Features using `LinearRegression()` object from sklearn library.\n", "- Plot the linear and polynomial model predictions along with the test data.\n", "- Compute the polynomial and linear model residuals using the formula below $\\epsilon = y_i - \\hat{y}$\n", "- Plot the histogram of the residuals and comment on your choice of the polynomial degree. \n", "\n", "## Hints: \n", "\n", "pd.DataFrame.head()\n", "Returns a pandas dataframe containing the data and labels from the file data.\n", "\n", "sklearn.model_selection.train_test_split()\n", "Splits the data into random train and test subsets.\n", "\n", "plt.subplots()\n", "Create a figure and a set of subplots.\n", "\n", "sklearn.preprocessing.PolynomialFeatures()\n", "Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.\n", "\n", "sklearn.preprocessing.StandardScaler.fit_transform()\n", "Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.\n", "\n", "sklearn.linear_model.LinearRegression\n", "LinearRegression fits a linear model.\n", "\n", "sklearn.linear_model.LinearRegression.fit()\n", "Fits the linear model to the training data.\n", "\n", "sklearn.linear_model.LinearRegression.predict()\n", "Predict using the linear model.\n", "\n", "plt.plot()\n", "Plots x versus y as lines and/or markers.\n", "\n", "plt.axvline()\n", "Add a vertical line across the axes.\n", "\n", "ax.hist()\n", "Plots a histogram.\n", "\n", "**Note:** This exercise is auto-graded and you can try multiple attempts. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n", "%matplotlib inline\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xy
0-3.292157-46.916988
10.799528-3.941553
2-0.936214-2.800522
3-4.722680-103.030914
4-3.602674-54.020819
\n", "
" ], "text/plain": [ " x y\n", "0 -3.292157 -46.916988\n", "1 0.799528 -3.941553\n", "2 -0.936214 -2.800522\n", "3 -4.722680 -103.030914\n", "4 -3.602674 -54.020819" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read the data from 'poly.csv' into a Pandas dataframe\n", "df = pd.read_csv('poly.csv')\n", "\n", "# Take a quick look at the dataframe\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Get the column values for x & y as numpy arrays\n", "x = df[['x']].values\n", "y = df['y'].values\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Helper code to plot x & y to visually inspect the data\n", "fig, ax = plt.subplots()\n", "ax.plot(x,y,'x')\n", "ax.set_xlabel('$x$ values')\n", "ax.set_ylabel('$y$ values')\n", "ax.set_title('$y$ vs $x$')\n", "plt.show();\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Split the data into train and test sets\n", "# Set the train size to 0.8 and random state to 22\n", "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=22)\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Initialize a linear model\n", "model = LinearRegression()\n", "\n", "# Fit the model on the train data\n", "model.fit(x_train, y_train)\n", "\n", "# Get the predictions on the test data using the trained model\n", "y_lin_pred = model.predict(x_test)\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "### edTest(test_deg) ###\n", "\n", "# Guess the correct polynomial degree based on the above graph\n", "guess_degree = 4\n", "\n", "# Generate polynomial features on the train data\n", "x_poly_train= PolynomialFeatures(degree=guess_degree).fit_transform(x_train)\n", "\n", "# Generate polynomial features on the test data\n", "x_poly_test= PolynomialFeatures(degree=guess_degree).fit_transform(x_test)\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Initialize a model to perform polynomial regression\n", "polymodel = LinearRegression(fit_intercept=False)\n", "\n", "# Fit the model on the polynomial transformed train data\n", "polymodel.fit(x_poly_train, y_train)\n", "\n", "# Predict on the entire polynomial transformed test data\n", "y_poly_pred = polymodel.predict(x_poly_test)\n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Helper code to visualise the results\n", "idx = np.argsort(x_test[:,0])\n", "x_test = x_test[idx]\n", "\n", "# Use the above index to get the appropriate predicted values for y_test\n", "# y_test values corresponding to sorted test data\n", "y_test = y_test[idx]\n", "\n", "# Linear predicted values \n", "y_lin_pred = y_lin_pred[idx]\n", "\n", "# Non-linear predicted values\n", "y_poly_pred= y_poly_pred[idx]\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# First plot x & y values using plt.scatter\n", "plt.scatter(x, y, s=10, label=\"Test Data\")\n", "\n", "# Plot the linear regression fit curve\n", "plt.plot(x_test, y_lin_pred, label=\"Linear fit\", color='k')\n", "\n", "# Plot the polynomial regression fit curve\n", "plt.plot(x_test, y_poly_pred, label=\"Polynomial fit\", color='red', alpha=0.6)\n", "\n", "# Assigning labels to the axes\n", "plt.xlabel(\"x values\")\n", "plt.ylabel(\"y values\")\n", "plt.legend()\n", "plt.show();\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "### edTest(test_poly_predictions) ###\n", "# Calculate the residual values for the polynomial model\n", "poly_residuals = y_test - y_poly_pred\n" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "### edTest(test_linear_predictions) ###\n", "# Calculate the residual values for the linear model\n", "lin_residuals = y_test - y_lin_pred\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Helper code to plot the residual values\n", "# Plot the histograms of the residuals for the two cases\n", "\n", "# Distribution of residuals\n", "fig, ax = plt.subplots(1,2, figsize = (10,4))\n", "bins = np.linspace(-20,20,20)\n", "ax[0].set_xlabel('Residuals')\n", "ax[0].set_ylabel('Frequency')\n", "\n", "# Plot the histograms for the polynomial regression\n", "ax[0].hist(poly_residuals, bins, label = \"poly_residuals\", color='#B2D7D0', alpha=0.6)\n", "\n", "# Plot the histograms for the linear regression\n", "ax[0].hist(lin_residuals, bins, label = \"lin_residuals\", color='#EFAEA4', alpha=0.6)\n", "\n", "ax[0].legend(loc = 'upper left')\n", "\n", "# Distribution of predicted values with the residuals\n", "ax[1].scatter(y_poly_pred, poly_residuals, s=10, color='#B2D7D0', label='Polynomial predictions')\n", "ax[1].scatter(y_lin_pred, lin_residuals, s = 10, color='#EFAEA4', label='Linear predictions' )\n", "ax[1].set_xlim(-75,75)\n", "ax[1].set_xlabel('Predicted values')\n", "ax[1].set_ylabel('Residuals')\n", "ax[1].legend(loc = 'upper left')\n", "\n", "fig.suptitle('Residual Analysis (Linear vs Polynomial)')\n", "plt.show();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ Do you think that polynomial degree is appropriate. Experiment with a degree of polynomial of 2 and comment on what you observe for the residuals?" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Type your answer within in the quotes given\n", "answer1 = '2 is better'\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }