CS109A Introduction to Data Science

Lab 9: Decision Trees, Bagged Trees, Random Forests and Boosting - Solutions

Harvard University
Fall 2018
Instructors: Pavlos Protopapas and Kevin Rader
Lab Instructor: Kevin Rader (today at least)
Authors: Kevin Rader, Rahul Dave


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

We will look here into the practicalities of fitting regression trees, random forests, and boosted trees. These involve out-of-bound estmates and cross-validation, and how you might want to deal with hyperparameters in these models. Along the way we will play a little bit with different loss functions, so that you start thinking about what goes in general into cooking up a machine learning model.

In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn.apionly as sns
sns.set_style("whitegrid")
sns.set_context("poster")
/anaconda3/lib/python3.7/site-packages/matplotlib/__init__.py:855: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
/anaconda3/lib/python3.7/site-packages/matplotlib/__init__.py:846: MatplotlibDeprecationWarning: 
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
/anaconda3/lib/python3.7/site-packages/seaborn/apionly.py:9: UserWarning: As seaborn no longer sets a default style on import, the seaborn.apionly module is deprecated. It will be removed in a future version.
  warnings.warn(msg, UserWarning)

Dataset

First, the data. We will be attempting to predict the presidential election results (at the county level) from 2016, measured as 'votergap' = (trump - clinton) in percentage points, based mostly on demographic features of those counties. Let's quick take a peak at the data:

In [5]:
elect_df = pd.read_csv("data/county_level_election.csv")
elect_df.head()
Out[5]:
state fipscode county population hispanic minority female unemployed income nodegree bachelor inactivity obesity density cancer votergap trump clinton
0 Colorado 8117 Summit County 27239 15.173 4.918 45.996 2.5 68352 5.4 48.1 8.1 13.1 46.0 46.2 -27.632 31.530 59.162
1 Colorado 8037 Eagle County 53653 30.040 5.169 47.231 3.1 76661 10.1 47.3 9.4 11.8 31.0 47.1 -19.897 36.058 55.955
2 Idaho 16067 Minidoka County 19226 34.070 5.611 49.318 3.7 46332 24.1 11.8 18.3 34.2 80.0 61.8 54.148 71.135 16.987
3 Colorado 8113 San Miguel County 7558 10.154 4.747 46.808 3.7 59603 4.7 54.4 12.4 16.7 5.7 62.6 -44.769 23.892 68.662
4 Utah 49051 Wasatch County 21600 13.244 4.125 48.812 3.4 65207 9.5 34.4 13.9 23.0 257.8 68.3 25.357 50.471 25.114
In [6]:
from sklearn.model_selection import train_test_split
In [7]:
# split 80/20 train-test
X = elect_df[['population','hispanic','minority','female','unemployed','income','nodegree','bachelor','inactivity','obesity','density','cancer']]
response = elect_df['votergap']
Xtrain, Xtest, ytrain, ytest = train_test_split(X,response,test_size=0.2)
In [8]:
plt.hist(ytrain)
Xtrain.hist(column=['minority', 'population','hispanic','female']);

How would you describe these variables?

In [9]:
print(elect_df.shape)
print(Xtrain.shape)
print(Xtest.shape)
(3066, 18)
(2452, 12)
(614, 12)

General Trees

We could use a simple Decision Tree regressor to predict votergap. Thats not the aim of this lab, so we'll run a few of these models without any cross-validation or 'regularization' just to illustrate what is going on.

This is what you ought to keep in mind about decision trees.

from the docs:

max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
  • The deeper the tree, the more prone you are to overfitting.
  • The smaller min_samples_split, the more the overfitting. One may use min_samples_leaf instead. More samples per leaf, the higher the bias.
In [10]:
from sklearn.tree import DecisionTreeRegressor
#x = np.arange(0, 2*np.pi, 0.1)
#y = np.sin(x) + 0.1*np.random.normal(size=x.shape[0])
x = Xtrain['minority'].values
o = np.argsort(x)
x = x[o]
y = ytrain.values
y = y[o]
In [11]:
plt.plot(x,y, '.');
In [12]:
plt.plot(np.log(x),y, '.');

**Exercise 0**
Which of the two versions of 'minority' would be a better choice to use as a predictor for inference? For prediction?

In [13]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [1,2]:
    dtree = DecisionTreeRegressor(max_depth=i)
    dtree.fit(xx, y)
    plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=1-i/10, lw=4)
plt.legend();
In [14]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
    dtree = DecisionTreeRegressor(min_samples_split=i)
    dtree.fit(xx, y)
    plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();
In [15]:
plt.plot(np.log(x),y,'.')
xx = np.log(x).reshape(-1,1)
for i in [500,200,100,20]:
    dtree = DecisionTreeRegressor(max_depth=6, min_samples_split=i)
    dtree.fit(xx, y)
    plt.plot(np.log(x), dtree.predict(xx), label=str(i), alpha=0.8, lw=4)
plt.legend();
In [16]:
#let's also include logminority as a predictor going forward
xtemp = np.log(Xtrain['minority'].values)
Xtrain = Xtrain.assign(logminority = xtemp)
Xtest = Xtest.assign(logminority = np.log(Xtest['minority'].values))
Xtrain.head()
Out[16]:
population hispanic minority female unemployed income nodegree bachelor inactivity obesity density cancer logminority
1569 17792 11.214 10.165 49.393 7.2 47781 11.0 22.4 17.0 24.4 12.6 231.8 2.318950
306 34410 2.767 24.961 49.938 5.3 64533 14.4 22.3 22.3 31.2 79.5 154.3 3.217315
1153 13736 5.875 2.856 49.122 5.7 51285 14.2 9.1 32.4 35.3 170.5 213.1 1.049422
934 769644 22.427 50.419 51.923 6.7 54603 16.2 32.3 25.6 28.7 324.4 202.5 3.920368
1415 8882 1.793 3.118 50.663 4.3 43178 8.9 14.0 30.1 32.8 23.0 224.3 1.137192

**Exercise 1**

  1. Perform 5-fold cross-validation to determine what the best max_depth would be for a single regression tree using the entire 'Xtrain' feature set.
  2. Visualize the results with mean +/- 2 sd's across the validation sets.
In [17]:
from sklearn.model_selection import cross_val_score

# your code here
depths = list(range(1, 21))
train_scores = []
cvmeans = []
cvstds = []
cv_scores = []
for depth in depths:
    dtree = DecisionTreeRegressor(max_depth=depth)
    # Perform 5-fold cross validation and store results
    train_scores.append(dtree.fit(Xtrain, ytrain).score(Xtrain, ytrain))
    scores = cross_val_score(estimator=dtree, X=Xtrain, y=ytrain, cv=5)
    cvmeans.append(scores.mean())
    cvstds.append(scores.std())

cvmeans = np.array(cvmeans)
cvstds = np.array(cvstds)
In [18]:
# your code here
# plot means and shade the 2 SD interval
plt.plot(depths, cvmeans, '*-', label="Mean CV")
plt.fill_between(depths, cvmeans - 2*cvstds, cvmeans + 2*cvstds, alpha=0.3)
ylim = plt.ylim()
plt.plot(depths, train_scores, '-+', label="Train")
plt.legend()
plt.ylabel("Accuracy")
plt.xlabel("Max Depth")
plt.xticks(depths);

Based on the plot above, it's clear that the training $R^2$ increases towards zero as max_depth increases, while the test set's $R^2$ maxes out around 62-63% around a max_depth of 7-9. Any of those values would be reasonable to choose for a best predictive model. To prevent overfitting and for parsimony, Iwould choose a max_depth of 7.

Ok with this discussion in mind, lets improve this model by Bagging.

Bootstrap-Aggregating (called Bagging)

Whats the basic idea?

  • A Single Decision tree is likely to overfit.
  • So lets introduce replication through Bootstrap sampling.
  • Bagging uses bootstrap resampling to create different training datasets. This way each training will give us a different tree.
  • Added bonus: the left off points can be used to as a natural "validation" set, so no need to
  • Since we have many trees that we will average over for prediction, we can choose a large max_depth and we are ok as we will rely on the law of large numbers to shrink this large variance, low bias approach for each individual tree.
In [19]:
from sklearn.utils import resample

ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))

plt.plot(np.log(x),y,'.')
for i in range(ntrees):
    simpletree = DecisionTreeRegressor(max_depth=3)
    boot_xx, boot_y = resample(Xtrain[['logminority']], ytrain)
    estimators = np.append(estimators,simpletree.fit(boot_xx, boot_y))
    R2s = np.append(R2s,simpletree.score(Xtest[['logminority']], ytest))
    yhats_test[:,i] = simpletree.predict(Xtest[['logminority']])
    plt.plot(np.log(x), simpletree.predict(np.log(x).reshape(-1,1)), 'red', alpha=0.05)
In [20]:
yhats_test.shape
Out[20]:
(614, 500)

**Exercise 2**

  1. Edit the code below (which is just copied from above) to refit many bagged trees on the entire Xtrain feature set (without the plot...lots of predictors now so difficult to plot).
  2. Summarize how each of the separate trees performed (both numerically and visually) using $R^2$ as the metric. How do they perform on average?
  3. Combine the trees into one prediction and evaluate it using $R^2$.
  4. Briefly discuss the results. How will the results above change if 'max_depth' is increased? What if it is decreased?
In [21]:
from sklearn.metrics import r2_score


ntrees = 500
estimators = []
R2s = []
yhats_test = np.zeros((Xtest.shape[0], ntrees))

for i in range(ntrees):
    simpletree = DecisionTreeRegressor(max_depth=4)
    boot_xx, boot_y = resample(Xtrain, ytrain)
    estimators = np.append(estimators,simpletree.fit(boot_xx, boot_y))
    R2s = np.append(R2s,simpletree.score(Xtest, ytest))
    yhats_test[:,i] = simpletree.predict(Xtest)

# your code here
#let's look at how the bagging does:
plt.hist(R2s);
print("Average R-squared score of the individual trees:", np.mean(R2s))

bagged_test_predictions = np.mean(yhats_test, axis=1)
print("R-squared score of the aggregated trees:",r2_score(ytest,bagged_test_predictions))
Average R-squared score of the individual trees: 0.552367431975
R-squared score of the aggregated trees: 0.631678199922
  1. Notice that the aggregated results, by averaging the predictions across the 500 different Regression Tree models, scores and $R^2$ in the test set of 64.9% which outperforms all of the separate individual models. We would expect this to improve further for larger values of max_depth, and in fact should imrpove quite a lot. This Bagging improves the variability of 'high variance' models by leveraging the law of large numbers, but does not improve bias nearly as much. Essentially, it is an indirect way of 'smoothing' these discretized step function by essentially jittering where those jumps occur. See the visual above. If we do change max_depth to something larger in the code above, we do see quite an improvement in predictive ability in the test set. A max_depth of 20 led to a test $R^2$ of 78.5%, for example.

Random Forests

What's the basic idea?

Bagging alone is not enough randomization, because even after bootstrapping, we are mainly training on the same data points using the same variablesn, and will retain much of the overfitting.

So we will build each tree by splitting on "random" subset of predictors at each split (hence, each is a 'random tree'). This can't be done in with just one predcitor, but with more predictors we can choose what predictors to split on randomly and how many to do this on. Then we combine many 'random trees' together by averaging their predictions, and this gets us a forest of random trees: a random forest.

Below we create a hyper-param Grid. We are preparing to use the bootstrap points not used in training for validation.

max_features : int, float, string or None, optional (default=”auto”)
- The number of features to consider when looking for the best split.
  • max_features: Default splits on all the features and is probably prone to overfitting. You'll want to validate on this.
  • You can "validate" on the trees n_estimators as well but many a times you will just look for the plateau in the trees as seen below.
  • From decision trees you get the max_depth, min_samples_split, and min_samples_leaf as well but you might as well leave those at defaults to get a maximally expanded tree.
In [22]:
from sklearn.ensemble import RandomForestRegressor
In [23]:
# code from 
# Adventures in scikit-learn's Random Forest by Gregory Saunders
from itertools import product
from collections import OrderedDict
param_dict = OrderedDict(
    n_estimators = [400, 600, 800],
    max_features = [0.2, 0.4, 0.6, 0.8]
)

param_dict.values()
Out[23]:
odict_values([[400, 600, 800], [0.2, 0.4, 0.6, 0.8]])

Using the OOB score.

We have been putting "validate" in quotes. This is because the bootstrap gives us left-over points! So we'll now engage in our very own version of a grid-search, done over the out-of-bag scores that sklearn gives us for free

In [24]:
from itertools import product
In [25]:
#make sure ytrain is the correct data type...in case you have warnings
#print(yytrain.shape,ytrain.shape,Xtrain.shape)
#ytrain = np.ravel(ytrain)

#Let's Cross-val. on the two 'hyperparameters' we based our grid on earlier
results = {}
estimators= {}
for ntrees, maxf in product(*param_dict.values()):
    params = (ntrees, maxf)
    est = RandomForestRegressor(oob_score=True, 
                                n_estimators=ntrees, max_features=maxf, max_depth=50, n_jobs=-1)
    est.fit(Xtrain, ytrain)
    results[params] = est.oob_score_
    estimators[params] = est
outparams = max(results, key = results.get)
outparams
Out[25]:
(800, 0.6)
In [26]:
rf1 = estimators[outparams]
In [27]:
results
Out[27]:
{(400, 0.2): 0.75545034159631475,
 (400, 0.4): 0.76904520627814343,
 (400, 0.6): 0.7714063389285728,
 (400, 0.8): 0.77093228908586808,
 (600, 0.2): 0.75389417406993875,
 (600, 0.4): 0.77059680071750347,
 (600, 0.6): 0.77163543964275172,
 (600, 0.8): 0.77293158741686452,
 (800, 0.2): 0.75642142962369552,
 (800, 0.4): 0.77153354385823691,
 (800, 0.6): 0.77302008538924816,
 (800, 0.8): 0.77211871587521286}
In [28]:
rf1.score(Xtest, ytest)
Out[28]:
0.77953336933057904

Finally you can find the feature importance of each predictor in this random forest model. Whenever a feature is used in a tree in the forest, the algorithm will log the decrease in the splitting criterion (such as gini). This is accumulated over all trees and reported in est.feature_importances_

In [29]:
pd.Series(rf1.feature_importances_,index=list(Xtrain)).sort_values().plot(kind="barh")
Out[29]:

Since our response isn't very symmetric, we may want to suppress outliers by using the mean_absolute_error instead.

In [30]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, rf1.predict(Xtest))
Out[30]:
11.606474552117264

**Bonus Exercise (we will likely skip this)**

  1. Tune the 'RandomForestRegressor' above to minimize 'mean_absolute_error' instead of the default __

Note: sklearn supports this (criterion='mae') since v0.18, but does not have completely arbitrary loss functions for Random Forests.

In [31]:
# your code here

# Note: there were estimation/convergence issues using MAE with CV, 
# so just one example is provided below.
rf_mae = RandomForestRegressor(oob_score=True, criterion='mae',
                                n_estimators=400, max_features=0.4, max_depth=20, n_jobs=-1)
rf_mae.fit(Xtrain, ytrain)
Out[31]:
RandomForestRegressor(bootstrap=True, criterion='mae', max_depth=20,
           max_features=0.4, max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
           oob_score=True, random_state=None, verbose=0, warm_start=False)
In [43]:
#rf_mae = estimators[outparams]
print(mean_absolute_error(ytest, rf_mae.predict(Xtest)))
print(rf_mae.score(Xtest, ytest))
11.7608364149
0.772670131678

This model would be expected to perform better in the test set when measuring mean absolute error, but not as good at mean squared error (but this is not guaranteed). Here, we do not cross-validate, so that might explain why mean absolute error is worse.

Note: instead of using oob scoring, we could do cross-validation (with GridSearchCV), and a cv of 3 will roughly be comparable (same approximate train-vs.-validation set sizes). But this will take much more time as you are doing the fit 3 times plus the bootstraps. So at least three times as long!

In [51]:
param_dict2 = OrderedDict(
    n_estimators = [400,600,800],
    max_features = [0.2, 0.4, 0.6, 0.8]
)
In [52]:
from sklearn.model_selection import GridSearchCV
est2 = RandomForestRegressor(oob_score=False)
gs = GridSearchCV(est2, param_grid = param_dict2, cv=3, n_jobs=-1)
gs.fit(Xtrain, ytrain)
Out[52]:
GridSearchCV(cv=3, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=OrderedDict([('n_estimators', [400, 600, 800]), ('max_features', [0.2, 0.4, 0.6, 0.8])]),
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [53]:
rf2 = gs.best_estimator_
rf2
Out[53]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=0.6, max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [54]:
gs.best_score_
Out[54]:
0.7624526002930303

**Exercise 3: thinking questions**

  1. What are the 3 hyperparameters for a random forest (one of the hyperparameters come in many flavors)?
  2. Which hyperparameters need to be tuned?

Answers

  1. The 3 hyperparameters are max_features, max_depth (or something related, like min_samples_split, so control the complexity of each tree), and n_estimators (in classification, the splitting criteria is another one you could consider)

  2. max_features and max_depth need to be tuned, while the higher n_estimators the better (though there is clearly diminishing return, so just use someting 'largish' in the hundreds). The best tuned max_features and max_depth will depend on the number of predictors you are considering, the number of observations in the training set, and whether it is a regression or a classification problem. There are lots of rules of thumb out there, but cross-validate if you can.

Seeing error as a function of the proportion of predictors at each split

Let's look to see how max_features affects performance on the training set.

In [55]:
# from http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

feats = param_dict['max_features']
# 
error_rate = OrderedDict((label, []) for label in feats)

# Range of `n_estimators` values to explore.
min_estimators = 200
step_estimators = 100
num_steps = 6
max_estimators = min_estimators + step_estimators*num_steps
for label in feats:
    for i in range(min_estimators, max_estimators+1, step_estimators):
        clf = RandomForestRegressor(oob_score=True, max_features=label)
        clf.set_params(n_estimators=i)
        clf.fit(Xtrain, ytrain)

        # Record the OOB error for each `n_estimators=i` setting.
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))
In [56]:
print(feats)
[0.2, 0.4, 0.6, 0.8]
In [57]:
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()

Boosting Regression Trees

Adaboost Classification, which you will be doing in your homework, is a special case of a gradient-boosted algorithm. Gradient Bossting is very state of the art, and has major connections to logistic regression, gradient descent in a functional space, and search in information space. See Schapire and Freund's MIT Press book for details (Google is a wonderful thing).

But briefly, let us cover the idea here. The idea is that we will use a bunch of weak 'learners' (aka, models) which are fit sequentially. The first one fits the signal, the second one the first model's residual, the third the second model's residual, and so on. At each stage we upweight the places that our previous model did badly on. First let us illustrate.

In [153]:
from sklearn.ensemble import AdaBoostRegressor
estab = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=1),
                          n_estimators=500, learning_rate=1)
estab.fit(xx, y)
staged_predict_generator = estab.staged_predict(xx)
In [59]:
# code from http://nbviewer.jupyter.org/github/pprett/pydata-gbrt-tutorial/blob/master/gbrt-tutorial.ipynb
import time
from IPython import display
plt.plot(xx, y, '.');
counter = 0
for stagepred in staged_predict_generator:
    counter = counter + 1
    if counter in [1, 2, 4, 8, 10, 50, 100, 200]:
        plt.plot(xx, stagepred, alpha=0.7, label=str(counter), lw=4)
        plt.legend();
        display.display(plt.gcf())
        display.clear_output(wait=True)
        time.sleep(1)
In [160]:
estab2 = AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=2),
                          n_estimators=1000, learning_rate=0.3)
estab2.fit(Xtrain, ytrain)
print(estab2.score(Xtrain, ytrain))
print(estab2.score(Xtest, ytest))

staged_scores = estab2.staged_score(Xtest, ytest)
i=1
scores=[]
for stagesc in staged_scores:
    scores=np.append(scores,stagesc)
    i = i+1
plt.plot(scores);
0.574891322117
0.533152494119

Ok, so this demonstration helps us understand some things about boosting.

  • n_estimators is the number of trees, and thus the stage in the fitting. It also controls the complexity for us. The more trees we have the more we fit to the tiny details.
  • staged_predict gives us the prediction at each step
  • once again max_depth from the underlying decision tree tells us the depth of the tree. But here it tells us the amount of features interactions we have, not just the scale of our fit. But clearly it increases the variance again.

Ideas from decision trees remain. For example, increase min_samples_leaf or decrease max_depth to reduce variance and increase the bias.

**Exercise 4**

  1. What do you expect to happen if you increase max_depth to 5? Edit the code above to explore the result.
  2. What do you expect to happen if you put max_depth back to 1 and decrease the learning_rate to 0.1? Edit the code above to explore the result.
  3. Do a little work to find some sort of 'best' values of max_depth and learning_rate. Does this result make sense?

Answers

  1. Increasing max_depth for each model does not really improve things. It will make each model more complex, which may not be a good thing. We want lots of little models that each are not overfit, so allowing any single model to have a lot of complexity (and thus overfit) could be a problem.
  2. The model will progress much more slowly to fitting to the data well. This may be a good thing: this is essentially a form of 'regularization' in each model and will help prevent overfitting to any one model. It allows future models to have more 'error' to be able to explain.
  3. See code below.
In [163]:
# your code here
param_dict3 = OrderedDict(
    n_estimators = [200],
    learning_rate = [0.2, 0.4, 0.6, 0.8]
)

best_scores = []
max_depths = [5,10,15]
for depth in max_depths:
    estab = AdaBoostRegressor(base_estimator=
                              DecisionTreeRegressor(max_depth=depth))
    gs = GridSearchCV(estab, param_grid = param_dict3, cv=3, n_jobs=-1)
    gs.fit(Xtrain, ytrain)
    best_scores = np.append(best_scores, gs.best_score_)
In [164]:
best_scores
Out[164]:
array([ 0.73579725,  0.76361808,  0.76487793])

Note: what's the relationship between residuals and the gradient?

Pavlos showed in class that for the squared loss, taking the gradient in the "data point functional space", ie a N-d space for N data points with each variable being $f(x_i)$ just gives us the residuals. It turns out that the gradient descent is a more general idea, and one can use this for different losses. And the upweighting of poorly fit points in AdaBoost is simply a weighing by gradient. If the gradient (or residual) is high it means you are far away from optimum in this functional space, and if you are at 0, you have a flat gradient!

The ideas from the general theory of gradient descent tell us this: we can slow the learning by shrinking the predictions of each tree by some small number, which is called the learning_rate (learning_rate). This "shrinkage" helps us not overshoot, but for a finite number of iterations also simultaneously ensures we dont overfit by being in the neighboorhood of the minimum rather than just at it! But we might need to increase the iterations some to get into the minimum area.