{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"\n",
"Exercise: Boosting Classification\n",
"\n",
"## Description :\n",
"\n",
"The aim of this exercise to understand classification using boosting by plotting the decision boundary after each stump. Your plot may resemble the image below:\n",
"\n",
"
\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the dataset `boostingclassifier.csv` as pandas dataframe and take a quick look.\n",
"- All columns except `landtype` are predictors. `landtype` is the response variable.\n",
"- Define the AdaBoost classifier from scratch within the function `AdaBoost_scratch`:\n",
" - Recall the AdaBoost algorithm from the slides:\n",
"\n",
"
\n",
" \n",
" - Remember, we can derive the learning rate, $$\\lambda^{(i)}λ(i)$$ , for our iith estimator, $T^{(i)}T(i)$, analytically. \n",
" \n",
"
\n",
"\n",
" - Note: In the exercise we call $$\\lambda^{(i)}λ(i)$$ the 'estimator weight.' This is because SKLearn's Adaboost implementation has a learning_rate parameter which refers to a global hyperparameter.\n",
"- Call the `AdaBoost_scratch` function with the predictor and response variables for 9 stumps.\n",
"- Use the helper code provided to visualize the classification decision boundary for the 9 stumps.\n",
"\n",
"## Hints: \n",
"\n",
"DecisionTreeClassifier()\n",
"A decision tree classifier.\n",
"\n",
"sklearn.fit()\n",
"Builds a model from the training set.\n",
"\n",
"np.average()\n",
"Computes the weighted average along the specified axis.\n",
"\n",
"np.mean()\n",
"Computes the arithmetic mean along the specified axis.\n",
"\n",
"np.log()\n",
"Natural logarithm, element-wise.\n",
"\n",
"np.exp()\n",
"Calculates the exponential of all elements in the input array.\n",
"\n",
"sklearn.AdaBoostClassifier()\n",
"An AdaBoost classifier.\n",
"\n",
"**Note:** This exercise is **auto-graded and you can make multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from helper import plot_decision_boundary\n",
"from matplotlib.colors import ListedColormap\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import AdaBoostClassifier\n",
"%matplotlib inline\n",
"sns.set_style('white')\n"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read the dataset as a pandas dataframe\n",
"df = pd.read_csv(\"boostingclassifier.csv\")\n",
"\n",
"# Read the columns latitude and longitude as the predictor variables\n",
"X = df[['latitude','longitude']].values\n",
"\n",
"# Landtype is the response variable\n",
"y = df['landtype'].values"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_response) ###\n",
"# update the class labels to appropriate values for AdaBoost\n",
"y = ___"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# AdaBoost algorithm implementation from scratch\n",
"\n",
"def AdaBoost_scratch(X, y, M=10):\n",
" '''\n",
" X: data matrix of predictors\n",
" y: response variable\n",
" M: number of estimators (e.g., 'stumps')\n",
" '''\n",
"\n",
" # Initialization of utility variables\n",
" N = len(y)\n",
" estimator_list = []\n",
" y_predict_list = []\n",
" estimator_error_list = []\n",
" estimator_weight_list = []\n",
" sample_weight_list = []\n",
"\n",
" # Initialize the sample weights\n",
" sample_weight = np.ones(N) / N\n",
" \n",
" # Store a copy of the sample weights to a list\n",
" # Q: why do we want to use .copy() here? The implementation will make it clear.\n",
" sample_weight_list.append(sample_weight.copy())\n",
"\n",
" # Fit each boosted stump\n",
" # Q: Why might we prefer the variable name '_' here over something like 'm'?\n",
" for _ in range(M): \n",
" # Instantiate a Decision Tree classifier for our stump\n",
" # Note: our stumps should have only a single split\n",
" estimator = ___\n",
" \n",
" # Fit the stump on the entire data with using the sample_weight variable\n",
" # Hint: check the estimator's documentation for how to use sample weights\n",
" estimator.fit(___)\n",
" \n",
" # Predict on the entire data\n",
" y_predict = estimator.predict(X)\n",
"\n",
" # Create a binary vector representing the misclassifications\n",
" incorrect = ___\n",
"\n",
" # Compute the error as the weighted average of the \n",
" # 'incorrect' vector above using the sample weights\n",
" # Hint: np.average() makes this very simple\n",
" estimator_error = ___\n",
" \n",
" # Compute the estimator weight using the estimator error\n",
" # Note: The estimator weight here is refered to as the 'learning rate' in the slides\n",
" estimator_weight = ___\n",
"\n",
" # Update the sample weights (un-normalized!)\n",
" # Note: Make use of the '*=' assignment statement\n",
" sample_weight *= ___\n",
"\n",
" # Renormalize the sample weights\n",
" # Note: Make use of the '/=' assignment statement\n",
" sample_weight /= ___\n",
"\n",
" # Save the iteration values\n",
" estimator_list.append(estimator)\n",
" y_predict_list.append(y_predict.copy())\n",
" estimator_error_list.append(estimator_error.copy())\n",
" estimator_weight_list.append(estimator_weight.copy())\n",
" sample_weight_list.append(sample_weight.copy())\n",
" \n",
"\n",
" # Convert to numpy array for convenience \n",
" estimator_list = np.asarray(estimator_list)\n",
" y_predict_list = np.asarray(y_predict_list)\n",
" estimator_error_list = np.asarray(estimator_error_list)\n",
" estimator_weight_list = np.asarray(estimator_weight_list)\n",
" sample_weight_list = np.asarray(sample_weight_list)\n",
"\n",
" # Compute the predictions\n",
" # Q: Why do we want to use np.sign() here?\n",
" preds = (np.array([np.sign((y_predict_list[:,point] * \\\n",
" estimator_weight_list).sum()) for point in range(N)]))\n",
" \n",
" # Return the model, estimated weights and sample weights\n",
" return estimator_list, estimator_weight_list, sample_weight_list, preds\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_adaboost) ###\n",
"# Call the AdaBoost function to perform boosting classification\n",
"estimator_list, estimator_weight_list, sample_weight_list, preds = \\\n",
"AdaBoost_scratch(X,y, M=9)\n",
"\n",
"# Calculate the model's accuracy from the predictions returned above\n",
"accuracy = ___\n",
"print(f'accuracy: {accuracy:.3f}')"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Helper code to plot the AdaBoost Decision Boundary stumps\n",
"fig = plt.figure(figsize = (16,16))\n",
"for m in range(0, 9):\n",
" fig.add_subplot(3,3,m+1)\n",
" s_weights = (sample_weight_list[m,:] / sample_weight_list[m,:].sum() ) * 300\n",
" plot_decision_boundary(estimator_list[m], X,y,N = 50, scatter_weights =s_weights,counter=m)\n",
" plt.tight_layout()\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [],
"source": [
"# Use sklearn's AdaBoostClassifier to take a look at the final decision boundary \n",
"\n",
"# Initialise the model with Decision Tree classifier as the base model same as above\n",
"# Use SAMME as the algorithm and 9 estimators\n",
"boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1), \n",
" algorithm = 'SAMME', n_estimators=9)\n",
"\n",
"# Fit on the entire data\n",
"boost.fit(X,y)\n",
"\n",
"# Call the plot_decision_boundary function to plot the decision boundary of the model \n",
"plot_decision_boundary(boost, X,y, N = 50)\n",
"\n",
"plt.title('AdaBoost Decision Boundary', fontsize=16)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸ How does the `num_estimators` affect the model?"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_chow1) ###\n",
"# Type your answer within in the quotes given\n",
"answer1 = '___'\n"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}