{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Title\n",
"\n",
"**Exercise - Decision Tree Classification**\n",
"\n",
"# Description\n",
"\n",
"The goal of the exercise is to get comfortable using decision trees for classification in `sklearn`. Eventually, you will produce a plot similar to the one given below:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instructions: \n",
"\n",
"We are trying to predict the winner of the 2016 Presidential election (Trump vs. Clinton) in each county in the US. To do this, we will consider several predictors including `minority`: the percentage of residents that are minorities and `bachelor`: the percentage of resident adults with a bachelor's degree (or higher). We will perform the following tasks\n",
"\n",
"Read and explore the data set\n",
"- Fit, visualize, and interpret a tree with 1 predictor\n",
"- Fit, visualize, and interpret a tree with 2 predictors\n",
"- Fit, visualize, interpret, and CV a best `max_depth` for a tree with many predictors\n",
"\n",
"# Hints:\n",
"\n",
"sklearn.DecisionTreeClassifier() : Generates a Logistic Regression classifier\n",
"\n",
"sklearn.score() : Accuracy classification score.\n",
"\n",
"matplotlib.contourf() : Accuracy classification score\n",
"\n",
"**Note: This exercise is auto-graded and you can try multiple attempts.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import sklearn as sk\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn import tree\n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"pd.set_option('display.width', 100)\n",
"pd.set_option('display.max_columns', 20)\n",
"plt.rcParams[\"figure.figsize\"] = (12,8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 0: Reading and Exploring the data \n",
"\n",
"We will be using the `county_election` dataset (provided separately as train and test versions for you) to model the outcome of the 2016 presidential election (Did Trump or Clinton win each county?) from various predictors.\n",
"\n",
"We start by reading in the datasets for you and visualizing the main predictors for now: `minority`:\n",
"\n",
"**Important note: use the training dataset for all exploratory analysis and model fitting. Only use the test dataset to evaluate and compare models.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elect_train = pd.read_csv(\"data/county_election_train.csv\")\n",
"elect_test = pd.read_csv(\"data/county_election_test.csv\")\n",
"elect_train.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# let's create the response variable and summarize it\n",
"\n",
"y_train = 1*(elect_train['trump']>elect_train['clinton'])\n",
"y_test = 1*(elect_test['trump']>elect_test['clinton'])\n",
"\n",
"print(\"The proportion of counties that favored Trump over Clinton in 2016 was:\",'%.4g' % np.mean(y_train) )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the main predictor's distribution via boxplots: and consider what the log-transformed version of it looks like:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,6])\n",
"\n",
"ax1.boxplot([elect_train.loc[y_train==0]['minority'],\n",
" elect_train.loc[y_train==1]['minority']],\n",
" labels=(\"Clinton\",\"Trump\"))\n",
"ax1.set_ylabel(\"Proportion of residents that are minorities\")\n",
"\n",
"ax2.boxplot([np.log(elect_train.loc[y_train==0]['minority']),\n",
" np.log(elect_train.loc[y_train==1]['minority'])],\n",
" labels=(\"Clinton\",\"Trump\"))\n",
"ax2.set_ylabel(\"Proportion of residents that are minorities\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q0.1** How would you describe the distribution of the variable `minority`? What issues does this create in logistic regression, $k$-NN, and Decision Trees? How can these issues be fixed? Which of the two versions of 'minority' would be a better choice to use as a predictor for inference? For prediction?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Decision Trees\n",
"\n",
"We could use a simple Decision Tree regressor to predict votergap. That's not the aim of this lab, so we'll run a few of these models without any cross-validation or 'regularization' just to illustrate what is going on.\n",
"\n",
"This is what you ought to keep in mind about decision trees.\n",
"\n",
"from the docs:\n",
"```\n",
"max_depth : int or None, optional (default=None)\n",
"The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.\n",
"min_samples_split : int, float, optional (default=2)\n",
"```\n",
"\n",
"- The deeper the tree, the more prone you are to overfitting.\n",
"- The smaller `min_samples_split`, the more the overfitting. One may use `min_samples_leaf` instead. More samples per leaf, the higher the bias (aka, simpler, underfit model).\n",
"\n",
"Below we fit 2 decision treees that limit the `max_depth`: a single split, and one with depth of 3 (resulting in 8 leaves)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elect_train['logminority'] = np.log(elect_train['minority'])\n",
"elect_test['logminority'] = np.log(elect_test['minority'])\n",
"\n",
"dummy_x = np.arange(np.min(elect_train['minority']),np.max(elect_train['minority']),0.01)\n",
"\n",
"plt.plot(elect_train['minority'],y_train,'.')\n",
"\n",
"for i in [1,3]:\n",
" dtree = DecisionTreeClassifier(max_depth=i)\n",
" dtree.fit(elect_train[['minority']],y_train)\n",
" plt.plot(dummy_x , dtree.predict(dummy_x.reshape(-1,1)), label=(\"Classifications, max depth =\"+str(i)), alpha=0.5, lw=4)\n",
" plt.plot(dummy_x , dtree.predict_proba(dummy_x.reshape(-1,1))[:,1], label=(\"Probabilities, max depth =\"+str(i)), alpha=0.5, lw=2)\n",
"\n",
"plt.legend();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the actual decision tree can be *printed out* using [sklearn.tree.plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import tree\n",
"\n",
"plt.figure(figsize=(16,8))\n",
"tree.plot_tree(dtree, filled=True)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.1** Interpret the printed out tree above: how does it match the scatterplot visualization of the tree?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.2** Play around with the various arguments to define the complexity of the decision tree: `max_depth`,`min_samples_split`, and `min_samples_leaf` (do 1 at a time for now, you can use multiple of these arguments). Roughly, at what point do these start to overfit?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.plot(elect_train['minority'],y_train,'.')\n",
"\n",
"for i in [1,30,100]:\n",
" dtree = DecisionTreeClassifier(min_samples_leaf=i)\n",
" dtree.fit(elect_train[['minority']],y_train)\n",
" plt.plot(dummy_x , dtree.predict(dummy_x.reshape(-1,1)), label=(\"min leaf size =\"+str(i)), alpha=0.8, lw=4)\n",
"\n",
"plt.legend();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tree.plot_tree(dtree, filled=True)\n",
"plt.show()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"*your answer here*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take this to the 2-dimensional feature/predictor set: also include `bachelor`\" the proportion of residents with at least a bachelor's degree. Let's start by visualizing the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=\".\",color=\"green\",label=\"Trump\")\n",
"plt.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=\".\",color=\"purple\",label=\"Clinton\")\n",
"plt.xlabel(\"minority\")\n",
"plt.ylabel(\"bachelor\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.3** Based on the scatterplot above, does there appear to be good separability between the two classes? If you were to create a single box around the points to separate the 2 classes, where would you draw the box (a decision tree with `max_depth=2`?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.4** Create two decision tree classifiers below: one with `max_depth=2` and one with `max_depth=10`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_dtrees) ###\n",
"\n",
"dtree2 = ___\n",
"dtree10 = ___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot the decision boundaries for these two trees (code provided for you below)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x1_min, x1_max = elect_train['minority'].min() - 1, elect_train['minority'].max() + 1\n",
"x2_min, x2_max = elect_train['bachelor'].min() - 1, elect_train['bachelor'].max() + 1\n",
"x1x, x2x = np.meshgrid(np.arange(x1_min, x1_max, 0.1),\n",
" np.arange(x2_min, x2_max, 0.1))\n",
"\n",
"yhat2 = dtree2.predict(np.c_[x1x.ravel(), x2x.ravel()]).reshape(x1x.shape)\n",
"yhat10 = dtree10.predict(np.c_[x1x.ravel(), x2x.ravel()]).reshape(x1x.shape)\n",
"\n",
"\n",
"fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,6])\n",
"\n",
"ax1.contourf(x1x, x2x, yhat2, alpha=0.2,cmap=\"PiYG\");\n",
"ax1.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=\".\",color=\"green\",label=\"Trump\")\n",
"ax1.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=\".\",color=\"purple\",label=\"Clinton\")\n",
"\n",
"ax1.set_xlabel(\"minority\")\n",
"ax1.set_ylabel(\"bachelor\")\n",
"ax1.set_title(\"Decision Tree with max_depth=2\")\n",
"ax1.legend()\n",
"\n",
"ax2.contourf(x1x, x2x, yhat10, alpha=0.2,cmap=\"PiYG\");\n",
"ax2.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=\".\",color=\"green\",label=\"Trump\")\n",
"ax2.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=\".\",color=\"purple\",label=\"Clinton\")\n",
"\n",
"ax2.set_xlabel(\"minority\")\n",
"ax2.set_ylabel(\"bachelor\")\n",
"ax2.set_title(\"Decision Tree with max_depth=10\")\n",
"ax2.legend()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.4** How do these trees compare? Is there clear over or under fitting for either of these tree?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"*your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.5** A larger `X_train` feature set is defined below with 8 predictors. Fit a decision tree with `max_depth = 15` to this feature set and calculate the accuracy score on both the train and test sets. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_dtree15) ###\n",
"\n",
"X_train = elect_train[['minority', 'density','hispanic','obesity','female','income','bachelor','inactivity']]\n",
"X_test = elect_test[['minority', 'density','hispanic','obesity','female','income','bachelor','inactivity']]\n",
"\n",
"dtree15 = ___\n",
"\n",
"dtree15_train_acc = ___\n",
"dtree15_test_acc = ___\n",
"print(\"Train accuracy =\", float('%.4g' % dtree15_train_acc),\"\\n Test accuracy =\",float('%.4g' % dtree15_test_acc))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Two plots are provided for you below to aid in interpreting this model (well, you have to fix the second one):\n",
"\n",
"1. The `feature_importances_` the measures the total improvement (reduction) of the cost/loss/criterion every time that feature defines a split. Note: the default is `criterion='gini`.\n",
"\n",
"2. A \"predicted probability plot\" to get a very rough idea as to what the model is saying about how the chances of a county voting for Trump in 2016 were related to `minority`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.Series(dtree15.feature_importances_,index=list(X_train)).sort_values().plot(kind=\"barh\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.6** Fix the spaghetti plot below so that it is at least a little interpretable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_spaghetti) ###\n",
"\n",
"###Fix this spaghetti plot! Use `np.argsort`\n",
"\n",
"phat15 = dtree15.predict_proba(X_train)[:,1]\n",
"order = ___\n",
"\n",
"minority_sorted = ___\n",
"phat15_sorted = ___\n",
"\n",
"\n",
"plt.scatter(X_train['minority'],y_train)\n",
"plt.plot(minority_sorted,phat15_sorted,alpha=0.5)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Q1.7** Perform 5-fold cross-validation to determine what the best `max_depth` would be for a single regression tree using the entire `X_train` feature set defined below. Visualize the results with mean +/- 2 sd's across the validation sets. Interpret the result."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(109)\n",
"\n",
"depths = list(range(1, 21))\n",
"train_scores = []\n",
"cvmeans = []\n",
"cvstds = []\n",
"cv_scores = []\n",
"for depth in depths:\n",
" dtree = DecisionTreeClassifier(max_depth=___)\n",
" # Perform 5-fold cross validation and store results\n",
" train_scores.append(dtree.fit(___,___).score(___,___))\n",
" scores = cross_val_score(estimator=___, X=___, y=___, cv=___)\n",
" cvmeans.append(scores.mean())\n",
" cvstds.append(scores.std())\n",
"\n",
"cvmeans = np.array(cvmeans)\n",
"cvstds = np.array(cvstds)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot means and shade the 2*SD interval\n",
"plt.plot(depths, cvmeans, '*-', label=\"Mean CV\")\n",
"plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)\n",
"ylim = plt.ylim()\n",
"plt.plot(depths, train_scores, '-+', label=\"Train\")\n",
"plt.legend()\n",
"plt.ylabel(\"Accuracy\")\n",
"plt.xlabel(\"Max Depth\")\n",
"plt.xticks(depths);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*you answer here*\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}