{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "\n",
    "Exercise: Boosting Classification\n",
    "\n",
    "## Description :\n",
    "\n",
    "The aim of this exercise to understand classification using boosting by plotting the decision boundary after each stump. Your plot may resemble the image below:\n",
    "\n",
    "<img src=\"../fig/fig1.png\" style=\"width: 500px;\">\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the dataset `boostingclassifier.csv` as pandas dataframe and take a quick look.\n",
    "- All columns except `landtype` are predictors. `landtype` is the response variable.\n",
    "- Define the AdaBoost classifier from scratch within the function `AdaBoost_scratch`:\n",
    "    - Recall the AdaBoost algorithm from the slides:\n",
    "\n",
    "<img src=\"../fig/fig2.png\" style=\"width: 500px;\">\n",
    "    \n",
    "    - Remember, we can derive the learning rate, $$\\lambda^{(i)}λ(i)$$ , for our iith estimator, $T^{(i)}T(i)$, analytically. \n",
    " \n",
    "<img src=\"../fig/fig3.png\" style=\"width: 500px;\">\n",
    "\n",
    "    - Note: In the exercise we call $$\\lambda^{(i)}λ(i)$$ the 'estimator weight.' This is because SKLearn's Adaboost implementation has a learning_rate parameter which refers to a global hyperparameter.\n",
    "- Call the `AdaBoost_scratch` function with the predictor and response variables for 9 stumps.\n",
    "- Use the helper code provided to visualize the classification decision boundary for the 9 stumps.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html\" target=\"_blank\">DecisionTreeClassifier()</a>\n",
    "A decision tree classifier.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit\" target=\"_blank\">sklearn.fit()</a>\n",
    "Builds a model from the training set.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.average.html\" target=\"_blank\">np.average()</a>\n",
    "Computes the weighted average along the specified axis.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.mean.html?highlight=mean#numpy.mean\" target=\"_blank\">np.mean()</a>\n",
    "Computes the arithmetic mean along the specified axis.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.log.html?highlight=log#numpy.log\" target=\"_blank\">np.log()</a>\n",
    "Natural logarithm, element-wise.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.exp.html?highlight=exp#numpy.exp\" target=\"_blank\">np.exp()</a>\n",
    "Calculates the exponential of all elements in the input array.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html?highlight=adaboost#sklearn.ensemble.AdaBoostClassifier\" target=\"_blank\">sklearn.AdaBoostClassifier()</a>\n",
    "An AdaBoost classifier.\n",
    "\n",
    "**Note:** This exercise is **auto-graded and you can make multiple attempts.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from helper import plot_decision_boundary\n",
    "from matplotlib.colors import ListedColormap\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "%matplotlib inline\n",
    "sns.set_style('white')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Read the dataset as a pandas dataframe\n",
    "df = pd.read_csv(\"boostingclassifier.csv\")\n",
    "\n",
    "# Read the columns latitude and longitude as the predictor variables\n",
    "X = df[['latitude','longitude']].values\n",
    "\n",
    "# Landtype is the response variable\n",
    "y = df['landtype'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_response) ###\n",
    "# update the class labels to appropriate values for AdaBoost\n",
    "y = ___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# AdaBoost algorithm implementation from scratch\n",
    "\n",
    "def AdaBoost_scratch(X, y, M=10):\n",
    "    '''\n",
    "    X: data matrix of predictors\n",
    "    y: response variable\n",
    "    M: number of estimators (e.g., 'stumps')\n",
    "    '''\n",
    "\n",
    "    # Initialization of utility variables\n",
    "    N = len(y)\n",
    "    estimator_list = []\n",
    "    y_predict_list = []\n",
    "    estimator_error_list = []\n",
    "    estimator_weight_list = []\n",
    "    sample_weight_list = []\n",
    "\n",
    "    # Initialize the sample weights\n",
    "    sample_weight = np.ones(N) / N\n",
    "    \n",
    "    # Store a copy of the sample weights to a list\n",
    "    # Q: why do we want to use .copy() here? The implementation will make it clear.\n",
    "    sample_weight_list.append(sample_weight.copy())\n",
    "\n",
    "    # Fit each boosted stump\n",
    "    # Q: Why might we prefer the variable name '_' here over something like 'm'?\n",
    "    for _ in range(M):   \n",
    "        # Instantiate a Decision Tree classifier for our stump\n",
    "        # Note: our stumps should have only a single split\n",
    "        estimator = ___\n",
    "        \n",
    "        # Fit the stump on the entire data with using the sample_weight variable\n",
    "        # Hint: check the estimator's documentation for how to use sample weights\n",
    "        estimator.fit(___)\n",
    "        \n",
    "        # Predict on the entire data\n",
    "        y_predict = estimator.predict(X)\n",
    "\n",
    "        # Create a binary vector representing the misclassifications\n",
    "        incorrect = ___\n",
    "\n",
    "        # Compute the error as the weighted average of the \n",
    "        # 'incorrect' vector above using the sample weights\n",
    "        # Hint: np.average() makes this very simple\n",
    "        estimator_error = ___\n",
    "        \n",
    "        # Compute the estimator weight using the estimator error\n",
    "        # Note: The estimator weight here is refered to as the 'learning rate' in the slides\n",
    "        estimator_weight =  ___\n",
    "\n",
    "        # Update the sample weights (un-normalized!)\n",
    "        # Note: Make use of the '*=' assignment statement\n",
    "        sample_weight *= ___\n",
    "\n",
    "        # Renormalize the sample weights\n",
    "        # Note: Make use of the '/=' assignment statement\n",
    "        sample_weight /= ___\n",
    "\n",
    "        # Save the iteration values\n",
    "        estimator_list.append(estimator)\n",
    "        y_predict_list.append(y_predict.copy())\n",
    "        estimator_error_list.append(estimator_error.copy())\n",
    "        estimator_weight_list.append(estimator_weight.copy())\n",
    "        sample_weight_list.append(sample_weight.copy())\n",
    "        \n",
    "\n",
    "    # Convert to numpy array for convenience   \n",
    "    estimator_list = np.asarray(estimator_list)\n",
    "    y_predict_list = np.asarray(y_predict_list)\n",
    "    estimator_error_list = np.asarray(estimator_error_list)\n",
    "    estimator_weight_list = np.asarray(estimator_weight_list)\n",
    "    sample_weight_list = np.asarray(sample_weight_list)\n",
    "\n",
    "    # Compute the predictions\n",
    "    # Q: Why do we want to use np.sign() here?\n",
    "    preds = (np.array([np.sign((y_predict_list[:,point] * \\\n",
    "    estimator_weight_list).sum()) for point in range(N)]))\n",
    "    \n",
    "    # Return the model, estimated weights and sample weights\n",
    "    return estimator_list, estimator_weight_list, sample_weight_list, preds\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_adaboost) ###\n",
    "# Call the AdaBoost function to perform boosting classification\n",
    "estimator_list, estimator_weight_list, sample_weight_list, preds  = \\\n",
    "AdaBoost_scratch(X,y, M=9)\n",
    "\n",
    "# Calculate the model's accuracy from the predictions returned above\n",
    "accuracy = ___\n",
    "print(f'accuracy: {accuracy:.3f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Helper code to plot the AdaBoost Decision Boundary stumps\n",
    "fig = plt.figure(figsize = (16,16))\n",
    "for m in range(0, 9):\n",
    "    fig.add_subplot(3,3,m+1)\n",
    "    s_weights = (sample_weight_list[m,:] / sample_weight_list[m,:].sum() ) * 300\n",
    "    plot_decision_boundary(estimator_list[m], X,y,N = 50, scatter_weights =s_weights,counter=m)\n",
    "    plt.tight_layout()\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use sklearn's AdaBoostClassifier to take a look at the final decision boundary \n",
    "\n",
    "# Initialise the model with Decision Tree classifier as the base model same as above\n",
    "# Use SAMME as the algorithm and 9 estimators\n",
    "boost = AdaBoostClassifier( base_estimator = DecisionTreeClassifier(max_depth = 1), \n",
    "                            algorithm = 'SAMME', n_estimators=9)\n",
    "\n",
    "# Fit on the entire data\n",
    "boost.fit(X,y)\n",
    "\n",
    "# Call the plot_decision_boundary function to plot the decision boundary of the model \n",
    "plot_decision_boundary(boost, X,y, N = 50)\n",
    "\n",
    "plt.title('AdaBoost Decision Boundary', fontsize=16)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ How does the `num_estimators` affect the model?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Type your answer within in the quotes given\n",
    "answer1 = '___'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}