{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "\n", "Exercise: Boosting Classification\n", "\n", "## Description :\n", "\n", "The aim of this exercise to understand classification using boosting by plotting the decision boundary after each stump. Your plot may resemble the image below:\n", "\n", "\n", "\n", "## Instructions:\n", "\n", "- Read the dataset `boostingclassifier.csv` as pandas dataframe and take a quick look.\n", "- All columns except `landtype` are predictors. `landtype` is the response variable.\n", "- Define the AdaBoost classifier from scratch within the function `AdaBoost_scratch`:\n", " - Recall the AdaBoost algorithm from the slides:\n", "\n", "\n", " \n", " - Remember, we can derive the learning rate, $$\\lambda^{(i)}λ(i)$$ , for our iith estimator, $T^{(i)}T(i)$, analytically. \n", " \n", "\n", "\n", " - Note: In the exercise we call $$\\lambda^{(i)}λ(i)$$ the 'estimator weight.' This is because SKLearn's Adaboost implementation has a learning_rate parameter which refers to a global hyperparameter.\n", "- Call the `AdaBoost_scratch` function with the predictor and response variables for 9 stumps.\n", "- Use the helper code provided to visualize the classification decision boundary for the 9 stumps.\n", "\n", "## Hints: \n", "\n", "DecisionTreeClassifier()\n", "A decision tree classifier.\n", "\n", "sklearn.fit()\n", "Builds a model from the training set.\n", "\n", "np.average()\n", "Computes the weighted average along the specified axis.\n", "\n", "np.mean()\n", "Computes the arithmetic mean along the specified axis.\n", "\n", "np.log()\n", "Natural logarithm, element-wise.\n", "\n", "np.exp()\n", "Calculates the exponential of all elements in the input array.\n", "\n", "sklearn.AdaBoostClassifier()\n", "An AdaBoost classifier.\n", "\n", "**Note:** This exercise is **auto-graded and you can make multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from helper import plot_decision_boundary\n", "from matplotlib.colors import ListedColormap\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "%matplotlib inline\n", "sns.set_style('white')\n" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Read the dataset as a pandas dataframe\n", "df = pd.read_csv(\"boostingclassifier.csv\")\n", "\n", "# Read the columns latitude and longitude as the predictor variables\n", "X = df[['latitude','longitude']].values\n", "\n", "# Landtype is the response variable\n", "y = df['landtype'].values" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "### edTest(test_response) ###\n", "# update the class labels to appropriate values for AdaBoost\n", "y = ___" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# AdaBoost algorithm implementation from scratch\n", "\n", "def AdaBoost_scratch(X, y, M=10):\n", " '''\n", " X: data matrix of predictors\n", " y: response variable\n", " M: number of estimators (e.g., 'stumps')\n", " '''\n", "\n", " # Initialization of utility variables\n", " N = len(y)\n", " estimator_list = []\n", " y_predict_list = []\n", " estimator_error_list = []\n", " estimator_weight_list = []\n", " sample_weight_list = []\n", "\n", " # Initialize the sample weights\n", " sample_weight = np.ones(N) / N\n", " \n", " # Store a copy of the sample weights to a list\n", " # Q: why do we want to use .copy() here? The implementation will make it clear.\n", " sample_weight_list.append(sample_weight.copy())\n", "\n", " # Fit each boosted stump\n", " # Q: Why might we prefer the variable name '_' here over something like 'm'?\n", " for _ in range(M): \n", " # Instantiate a Decision Tree classifier for our stump\n", " # Note: our stumps should have only a single split\n", " estimator = ___\n", " \n", " # Fit the stump on the entire data with using the sample_weight variable\n", " # Hint: check the estimator's documentation for how to use sample weights\n", " estimator.fit(___)\n", " \n", " # Predict on the entire data\n", " y_predict = estimator.predict(X)\n", "\n", " # Create a binary vector representing the misclassifications\n", " incorrect = ___\n", "\n", " # Compute the error as the weighted average of the \n", " # 'incorrect' vector above using the sample weights\n", " # Hint: np.average() makes this very simple\n", " estimator_error = ___\n", " \n", " # Compute the estimator weight using the estimator error\n", " # Note: The estimator weight here is refered to as the 'learning rate' in the slides\n", " estimator_weight = ___\n", "\n", " # Update the sample weights (un-normalized!)\n", " # Note: Make use of the '*=' assignment statement\n", " sample_weight *= ___\n", "\n", " # Renormalize the sample weights\n", " # Note: Make use of the '/=' assignment statement\n", " sample_weight /= ___\n", "\n", " # Save the iteration values\n", " estimator_list.append(estimator)\n", " y_predict_list.append(y_predict.copy())\n", " estimator_error_list.append(estimator_error.copy())\n", " estimator_weight_list.append(estimator_weight.copy())\n", " sample_weight_list.append(sample_weight.copy())\n", " \n", "\n", " # Convert to numpy array for convenience \n", " estimator_list = np.asarray(estimator_list)\n", " y_predict_list = np.asarray(y_predict_list)\n", " estimator_error_list = np.asarray(estimator_error_list)\n", " estimator_weight_list = np.asarray(estimator_weight_list)\n", " sample_weight_list = np.asarray(sample_weight_list)\n", "\n", " # Compute the predictions\n", " # Q: Why do we want to use np.sign() here?\n", " preds = (np.array([np.sign((y_predict_list[:,point] * \\\n", " estimator_weight_list).sum()) for point in range(N)]))\n", " \n", " # Return the model, estimated weights and sample weights\n", " return estimator_list, estimator_weight_list, sample_weight_list, preds\n", " " ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "### edTest(test_adaboost) ###\n", "# Call the AdaBoost function to perform boosting classification\n", "estimator_list, estimator_weight_list, sample_weight_list, preds = \\\n", "AdaBoost_scratch(X,y, M=9)\n", "\n", "# Calculate the model's accuracy from the predictions returned above\n", "accuracy = ___\n", "print(f'accuracy: {accuracy:.3f}')" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Helper code to plot the AdaBoost Decision Boundary stumps\n", "fig = plt.figure(figsize = (16,16))\n", "for m in range(0, 9):\n", " fig.add_subplot(3,3,m+1)\n", " s_weights = (sample_weight_list[m,:] / sample_weight_list[m,:].sum() ) * 300\n", " plot_decision_boundary(estimator_list[m], X,y,N = 50, scatter_weights =s_weights,counter=m)\n", " plt.tight_layout()\n", " " ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "# Use sklearn's AdaBoostClassifier to take a look at the final decision boundary \n", "\n", "# Initialise the model with Decision Tree classifier as the base model same as above\n", "# Use SAMME as the algorithm and 9 estimators\n", "boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1), \n", " algorithm = 'SAMME', n_estimators=9)\n", "\n", "# Fit on the entire data\n", "boost.fit(X,y)\n", "\n", "# Call the plot_decision_boundary function to plot the decision boundary of the model \n", "plot_decision_boundary(boost, X,y, N = 50)\n", "\n", "plt.title('AdaBoost Decision Boundary', fontsize=16)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ How does the `num_estimators` affect the model?" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Type your answer within in the quotes given\n", "answer1 = '___'\n" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }