Title

Exercise - Decision Tree Classification

Description

The goal of the exercise is to get comfortable using decision trees for classification in sklearn. Eventually, you will produce a plot similar to the one given below:

Instructions:

We are trying to predict the winner of the 2016 Presidential election (Trump vs. Clinton) in each county in the US. To do this, we will consider several predictors including minority: the percentage of residents that are minorities and bachelor: the percentage of resident adults with a bachelor's degree (or higher). We will perform the following tasks

Read and explore the data set

  • Fit, visualize, and interpret a tree with 1 predictor
  • Fit, visualize, and interpret a tree with 2 predictors
  • Fit, visualize, interpret, and CV a best max_depth for a tree with many predictors

Hints:

sklearn.DecisionTreeClassifier() : Generates a Logistic Regression classifier

sklearn.score() : Accuracy classification score.

matplotlib.contourf() : Accuracy classification score

Note: This exercise is auto-graded and you can try multiple attempts.

In [ ]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from sklearn.model_selection import cross_val_score

pd.set_option('display.width', 100)
pd.set_option('display.max_columns', 20)
plt.rcParams["figure.figsize"] = (12,8)

Part 0: Reading and Exploring the data

We will be using the county_election dataset (provided separately as train and test versions for you) to model the outcome of the 2016 presidential election (Did Trump or Clinton win each county?) from various predictors.

We start by reading in the datasets for you and visualizing the main predictors for now: minority:

Important note: use the training dataset for all exploratory analysis and model fitting. Only use the test dataset to evaluate and compare models.

In [ ]:
elect_train = pd.read_csv("data/county_election_train.csv")
elect_test = pd.read_csv("data/county_election_test.csv")
elect_train.head()
In [ ]:
# let's create the response variable and summarize it

y_train = 1*(elect_train['trump']>elect_train['clinton'])
y_test = 1*(elect_test['trump']>elect_test['clinton'])

print("The proportion of counties that favored Trump over Clinton in 2016 was:",'%.4g' % np.mean(y_train) )

Let's look at the main predictor's distribution via boxplots: and consider what the log-transformed version of it looks like:

In [ ]:
fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,6])

ax1.boxplot([elect_train.loc[y_train==0]['minority'],
             elect_train.loc[y_train==1]['minority']],
             labels=("Clinton","Trump"))
ax1.set_ylabel("Proportion of residents that are minorities")

ax2.boxplot([np.log(elect_train.loc[y_train==0]['minority']),
             np.log(elect_train.loc[y_train==1]['minority'])],
             labels=("Clinton","Trump"))
ax2.set_ylabel("Proportion of residents that are minorities")
plt.show()

Q0.1 How would you describe the distribution of the variable minority? What issues does this create in logistic regression, $k$-NN, and Decision Trees? How can these issues be fixed? Which of the two versions of 'minority' would be a better choice to use as a predictor for inference? For prediction?

your answer here


Part 1: Decision Trees

We could use a simple Decision Tree regressor to predict votergap. That's not the aim of this lab, so we'll run a few of these models without any cross-validation or 'regularization' just to illustrate what is going on.

This is what you ought to keep in mind about decision trees.

from the docs:

max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
  • The deeper the tree, the more prone you are to overfitting.
  • The smaller min_samples_split, the more the overfitting. One may use min_samples_leaf instead. More samples per leaf, the higher the bias (aka, simpler, underfit model).

Below we fit 2 decision treees that limit the max_depth: a single split, and one with depth of 3 (resulting in 8 leaves).

In [ ]:
elect_train['logminority'] = np.log(elect_train['minority'])
elect_test['logminority'] = np.log(elect_test['minority'])

dummy_x = np.arange(np.min(elect_train['minority']),np.max(elect_train['minority']),0.01)

plt.plot(elect_train['minority'],y_train,'.')

for i in [1,3]:
    dtree = DecisionTreeClassifier(max_depth=i)
    dtree.fit(elect_train[['minority']],y_train)
    plt.plot(dummy_x , dtree.predict(dummy_x.reshape(-1,1)), label=("Classifications,  max depth ="+str(i)), alpha=0.5, lw=4)
    plt.plot(dummy_x , dtree.predict_proba(dummy_x.reshape(-1,1))[:,1], label=("Probabilities, max depth ="+str(i)), alpha=0.5, lw=2)

plt.legend();

And the actual decision tree can be printed out using sklearn.tree.plot_tree:

In [ ]:
from sklearn import tree

plt.figure(figsize=(16,8))
tree.plot_tree(dtree, filled=True)
plt.show()

Q1.1 Interpret the printed out tree above: how does it match the scatterplot visualization of the tree?

your answer here

Q1.2 Play around with the various arguments to define the complexity of the decision tree: max_depth,min_samples_split, and min_samples_leaf (do 1 at a time for now, you can use multiple of these arguments). Roughly, at what point do these start to overfit?

In [ ]:
plt.plot(elect_train['minority'],y_train,'.')

for i in [1,30,100]:
    dtree = DecisionTreeClassifier(min_samples_leaf=i)
    dtree.fit(elect_train[['minority']],y_train)
    plt.plot(dummy_x , dtree.predict(dummy_x.reshape(-1,1)), label=("min leaf size ="+str(i)), alpha=0.8, lw=4)

plt.legend();
In [ ]:
tree.plot_tree(dtree, filled=True)
plt.show()
*your answer here*

Let's take this to the 2-dimensional feature/predictor set: also include bachelor" the proportion of residents with at least a bachelor's degree. Let's start by visualizing the data:

In [ ]:
plt.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=".",color="green",label="Trump")
plt.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=".",color="purple",label="Clinton")
plt.xlabel("minority")
plt.ylabel("bachelor")
plt.legend()
plt.show()

Q1.3 Based on the scatterplot above, does there appear to be good separability between the two classes? If you were to create a single box around the points to separate the 2 classes, where would you draw the box (a decision tree with max_depth=2?

your answer here

Q1.4 Create two decision tree classifiers below: one with max_depth=2 and one with max_depth=10?

In [ ]:
### edTest(test_dtrees) ###

dtree2 = ___
dtree10 = ___

Let's plot the decision boundaries for these two trees (code provided for you below).

In [ ]:
x1_min, x1_max = elect_train['minority'].min() - 1, elect_train['minority'].max() + 1
x2_min, x2_max = elect_train['bachelor'].min() - 1, elect_train['bachelor'].max() + 1
x1x, x2x = np.meshgrid(np.arange(x1_min, x1_max, 0.1),
                     np.arange(x2_min, x2_max, 0.1))

yhat2 = dtree2.predict(np.c_[x1x.ravel(), x2x.ravel()]).reshape(x1x.shape)
yhat10 = dtree10.predict(np.c_[x1x.ravel(), x2x.ravel()]).reshape(x1x.shape)


fig, (ax1,ax2) = plt.subplots(1,2, figsize=[15,6])

ax1.contourf(x1x, x2x, yhat2, alpha=0.2,cmap="PiYG");
ax1.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=".",color="green",label="Trump")
ax1.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=".",color="purple",label="Clinton")

ax1.set_xlabel("minority")
ax1.set_ylabel("bachelor")
ax1.set_title("Decision Tree with max_depth=2")
ax1.legend()

ax2.contourf(x1x, x2x, yhat10, alpha=0.2,cmap="PiYG");
ax2.scatter(elect_train['minority'][y_train==1], elect_train['bachelor'][y_train==1],marker=".",color="green",label="Trump")
ax2.scatter(elect_train['minority'][y_train==0], elect_train['bachelor'][y_train==0],marker=".",color="purple",label="Clinton")

ax2.set_xlabel("minority")
ax2.set_ylabel("bachelor")
ax2.set_title("Decision Tree with max_depth=10")
ax2.legend()
plt.show()

Q1.4 How do these trees compare? Is there clear over or under fitting for either of these tree?

In [ ]:
*your answer here*

Q1.5 A larger X_train feature set is defined below with 8 predictors. Fit a decision tree with max_depth = 15 to this feature set and calculate the accuracy score on both the train and test sets.

In [ ]:
### edTest(test_dtree15) ###

X_train = elect_train[['minority', 'density','hispanic','obesity','female','income','bachelor','inactivity']]
X_test = elect_test[['minority', 'density','hispanic','obesity','female','income','bachelor','inactivity']]

dtree15 = ___

dtree15_train_acc = ___
dtree15_test_acc = ___
print("Train accuracy =", float('%.4g' % dtree15_train_acc),"\n Test accuracy =",float('%.4g' % dtree15_test_acc))

Two plots are provided for you below to aid in interpreting this model (well, you have to fix the second one):

  1. The feature_importances_ the measures the total improvement (reduction) of the cost/loss/criterion every time that feature defines a split. Note: the default is criterion='gini.

  2. A "predicted probability plot" to get a very rough idea as to what the model is saying about how the chances of a county voting for Trump in 2016 were related to minority.

In [ ]:
pd.Series(dtree15.feature_importances_,index=list(X_train)).sort_values().plot(kind="barh");

Q1.6 Fix the spaghetti plot below so that it is at least a little interpretable.

In [ ]:
### edTest(test_spaghetti) ###

###Fix this spaghetti plot! Use `np.argsort`

phat15 = dtree15.predict_proba(X_train)[:,1]
order = ___

minority_sorted = ___
phat15_sorted = ___


plt.scatter(X_train['minority'],y_train)
plt.plot(minority_sorted,phat15_sorted,alpha=0.5)
plt.show()

Q1.7 Perform 5-fold cross-validation to determine what the best max_depth would be for a single regression tree using the entire X_train feature set defined below. Visualize the results with mean +/- 2 sd's across the validation sets. Interpret the result.

In [ ]:
np.random.seed(109)

depths = list(range(1, 21))
train_scores = []
cvmeans = []
cvstds = []
cv_scores = []
for depth in depths:
    dtree = DecisionTreeClassifier(max_depth=___)
    # Perform 5-fold cross validation and store results
    train_scores.append(dtree.fit(___,___).score(___,___))
    scores = cross_val_score(estimator=___, X=___, y=___, cv=___)
    cvmeans.append(scores.mean())
    cvstds.append(scores.std())

cvmeans = np.array(cvmeans)
cvstds = np.array(cvstds)
In [ ]:
# plot means and shade the 2*SD interval
plt.plot(depths, cvmeans, '*-', label="Mean CV")
plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)
ylim = plt.ylim()
plt.plot(depths, train_scores, '-+', label="Train")
plt.legend()
plt.ylabel("Accuracy")
plt.xlabel("Max Depth")
plt.xticks(depths);

you answer here

In [ ]: