{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise: Regression with Bagging**\n", "\n", "# Description\n", "\n", "The aim of this exercise is to understand bagging regression. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions:\n", "- Read the dataset airquality.csv as a pandas dataframe.\n", "- Take a quick look at the dataset.\n", "- Split the data into train and test sets.\n", "- Specify the number of bootstraps as 30 and a maximum depth of 3.\n", "- Define a Bagging Regression model that uses Decision Tree as its base estimator.\n", "- Fit the model on the train data.\n", "- Use the helper code to predict using the mean model and individual estimators. The plot will look similar to the one given above.\n", "- Predict on the test data using the first estimator and the mean model.\n", "- Compute and display the test MSEs.\n", "\n", "# Hints:\n", "\n", "sklearn.train_test_split() : Split arrays or matrices into random train and test subsets.\n", "\n", "BaggingRegressor() : Returns a Bagging regressor instance.\n", "\n", "DecisionTreeRegressor() : A decision tree regressor.\n", "\n", "DecisionTreeRegressor().estimators_ : A list of estimators. Use this to access any of the estimators. \n", "\n", "sklearn.mean_squared_error() : Mean squared error regression loss." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "\n", "import numpy as np\n", "from numpy import mean\n", "from numpy import std\n", "from sklearn.datasets import make_regression\n", "from sklearn.ensemble import BaggingRegressor\n", "import matplotlib.pyplot as plt\n", "import pandas as pd \n", "import itertools\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import train_test_split\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Read the dataset\n", "df = pd.read_csv(\"airquality.csv\",index_col=0)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Take a quick look at the data\n", "df.head(10)\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# We will only use Ozone for this exerice. Drop any notnas\n", "df = df[df.Ozone.notna()]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Assign \"x\" column as the predictor variable, only use Ozone, and \"y\" as the\n", "x = df[['Ozone']].values\n", "y = df['Temp']" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Split the data into train and test sets with train size as 0.8 and random_state as 102\n", "x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=102)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bagging Regressor" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OzoneSolar.RWindTempMonthDay
141.0190.07.46751
236.0118.08.07252
312.0149.012.67453
418.0313.011.56254
628.0NaN14.96656
723.0299.08.66557
819.099.013.85958
98.019.020.16159
117.0NaN6.974511
1216.0256.09.769512
1311.0290.09.266513
1414.0274.010.968514
1518.065.013.258515
1614.0334.011.564516
1734.0307.012.066517
186.078.018.457518
1930.0322.011.568519
2011.044.09.762520
211.08.09.759521
2211.0320.016.673522
234.025.09.761523
2432.092.012.061524
2823.013.012.067528
\n", "
" ], "text/plain": [ " Ozone Solar.R Wind Temp Month Day\n", "1 41.0 190.0 7.4 67 5 1\n", "2 36.0 118.0 8.0 72 5 2\n", "3 12.0 149.0 12.6 74 5 3\n", "4 18.0 313.0 11.5 62 5 4\n", "6 28.0 NaN 14.9 66 5 6\n", "7 23.0 299.0 8.6 65 5 7\n", "8 19.0 99.0 13.8 59 5 8\n", "9 8.0 19.0 20.1 61 5 9\n", "11 7.0 NaN 6.9 74 5 11\n", "12 16.0 256.0 9.7 69 5 12\n", "13 11.0 290.0 9.2 66 5 13\n", "14 14.0 274.0 10.9 68 5 14\n", "15 18.0 65.0 13.2 58 5 15\n", "16 14.0 334.0 11.5 64 5 16\n", "17 34.0 307.0 12.0 66 5 17\n", "18 6.0 78.0 18.4 57 5 18\n", "19 30.0 322.0 11.5 68 5 19\n", "20 11.0 44.0 9.7 62 5 20\n", "21 1.0 8.0 9.7 59 5 21\n", "22 11.0 320.0 16.6 73 5 22\n", "23 4.0 25.0 9.7 61 5 23\n", "24 32.0 92.0 12.0 61 5 24\n", "28 23.0 13.0 12.0 67 5 28" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "# Specify the number of bootstraps as 30\n", "num_bootstraps = 30\n", "\n", "# Specify the maximum depth of the decision tree as 3\n", "max_depth = 3\n", "\n", "# Define the Bagging Regressor Model\n", "# Use Decision Tree as your base estimator with depth as mentioned in max_depth\n", "# Initialise number of estimators using the num_bootstraps value\n", "model = ___\n", " \n", "\n", "# Fit the model on the train data\n", "___\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Helper code to plot the predictions of individual estimators and \n", "plt.figure(figsize=(10,8))\n", "\n", "xrange = np.linspace(x.min(),x.max(),80).reshape(-1,1)\n", "plt.plot(x_train,y_train,'o',color='#EFAEA4', markersize=6, label=\"Train Data\")\n", "plt.plot(x_test,y_test,'o',color='#F6345E', markersize=6, label=\"Test Data\")\n", "\n", "plt.xlim()\n", "for i in model.estimators_:\n", " y_pred1 = i.predict(xrange)\n", " plt.plot(xrange,y_pred1,alpha=0.5,linewidth=0.5,color = '#ABCCE3')\n", "plt.plot(xrange,y_pred1,alpha=0.6,linewidth=1,color = '#ABCCE3',label=\"Prediction of Individual Estimators\")\n", "\n", "\n", "y_pred = model.predict(xrange)\n", "plt.plot(xrange,y_pred,alpha=0.7,linewidth=3,color='#50AEA4', label='Model Prediction')\n", "plt.xlabel(\"Ozone\", fontsize=16)\n", "plt.ylabel(\"Temperature\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "plt.legend(loc='best',fontsize=12)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Compute the test MSE of the prediction of individual estimator\n", "y_pred1 = ___\n", "print(\"The test MSE of one estimator in the model is\", round(mean_squared_error(y_test,y_pred1),2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mse) ###\n", "# Compute the test MSE of the model prediction\n", "y_pred = ___\n", "print(\"The test MSE of the model is\",round(mean_squared_error(y_test,y_pred),2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mindchow 🍲\n", "\n", "After marking, go back and change the number of bootstraps and the maximum depth of the tree. \n", "\n", "\n", "- Do you see any relation between them? \n", "\n", "- How does the variance change with change in maximum depth?\n", "\n", "- How does the variance change with change in number of bootstraps?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Your answer here*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }