Title

Exercise 2: Hyperparameter tuning

Description

Tuning the hyperparameters

Random Forests perform very well out-of-the-box, with the pre-set hyperparameters in sklearn. Some of the tunable parameters are:

  • The number of trees in the forest: n_estimators, int, default=100
  • The complexity of each tree: stop when a leaf has <= min_samples_leaf samples
  • The sampling scheme: number of features to consider at any given split: max_features {“auto”, “sqrt”, “log2”}, int or float, default=”auto”.

Hints:

RandomForestClassifier() : Defines the RandomForestClassifier and includes more details on the definition and range of values for its tunable parameters.

model.predict_proba(X) : Predict class probabilities for X

roc_auc(y_test, y_proba) : Calculates the area under the receiver operating curve (AUC).

GridSearchCV() : Performes exhaustive search over specified parameter values for an estimator.

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
%matplotlib inline
In [2]:
# Read the dataset and take a quick look
df = pd.read_csv("diabetes.csv")
df.head()
In [3]:
# Assign the predictor and response variables.
# Outcome is the response and all the other columns are the predictors
X = df.drop("Outcome", axis=1)
y = df['Outcome']
X.shape, y.shape
In [4]:
# set the seed for reproducibility of results
seed = 0
# split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=seed)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

Vanila random forest

Start by training a Random Forest Classifier using the default parameters and calculate the Receiver Operating Characteristic Area Under the Curve (ROC AUC). As we know, this metric is better than accuracy for a classification problem, since it covers the case of an imbalanced dataset.

In [5]:
### edTest(test_vanilla) ### 
from sklearn.metrics import roc_auc_score

# Define a Random Forest classifier with randon_state = seed
vanilla_rf = __ 

# Fit the model on the entire data
vanilla_rf.fit(__, __);

# Calculate AUC/ROC on the test set
y_proba = __[:, 1] 
auc = np.round(roc_auc_score(y_test, y_proba),2)
print(f'Plain RF AUC on test set:{auc}')
In [6]:
# number of samples and features
num_features = X_train.shape[1]
num_samples = X_train.shape[0]
num_samples, num_features

1. Number of trees, num_iterators, default = 100

The number of trees needs to be large enough for the $oob$ error to stabilize in its lowest possible value. Plot the $oob$ error of a random forest as a function of the number of trees. Trees in a RF are called estimators. A good start is 10 times the number of features, however, adjusting other hyperparameters will influence the optimum number of trees.

In [7]:
%%time
from collections import OrderedDict
clf = RandomForestClassifier(warm_start=True, 
                               oob_score=True,
                               min_samples_leaf=40,
                               max_depth = 10,
                               random_state=seed)

error_rate = {}

# Range of `n_estimators` values to explore.
min_estimators = 150
max_estimators = 500

for i in range(min_estimators, max_estimators + 1):
    clf.set_params(n_estimators=i) 
    clf.fit(X_train, y_train)

    # Record the OOB error for each `n_estimators=i` setting.
    oob_error = 1 - clf.oob_score_
    error_rate[i] = oob_error
In [8]:
%%time
# Generate the "OOB error rate" vs. "n_estimators" plot.
# OOB error rate = num_missclassified/total observations (%)\
xs = []
ys = []
for label, clf_err in error_rate.items():
    xs.append(label)
    ys.append(clf_err)   
plt.plot(xs, ys)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.show()

2. min_samples_leaf, default = 1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. We will plot various values of the min_samples_leaf with num_iterators.

In [9]:
%%time
from collections import OrderedDict
ensemble_clfs = [
    (1,
        RandomForestClassifier(warm_start=True, 
                               min_samples_leaf=1,
                               oob_score=True,
                               max_depth = 10,
                               random_state=seed)),
    (5,
        RandomForestClassifier(warm_start=True, 
                               min_samples_leaf=5,
                               oob_score=True,
                               max_depth = 10,
                               random_state=seed))
]

# Map a label (the value of `min_samples_leaf`) to a list of (model, oob error) tuples.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)

min_estimators = 80
max_estimators = 500

for label, clf in ensemble_clfs:
    for i in range(min_estimators, max_estimators + 1):
        clf.set_params(n_estimators=i) 
        clf.fit(X_train, y_train)

        # Record the OOB error for each model. Error is 1 - oob_score
        # oob_score: score of the training dataset obtained using an 
        # out-of-bag estimate.
        # OOB error rate is % of num_missclassified/total observations
        oob_error = 1 - clf.oob_score_
        error_rate[label].append((i, oob_error))

for label, clf_err in error_rate.items():
    xs, ys = zip(*clf_err)
    plt.plot(xs, ys, label=f'min_samples_leaf={label}')

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()
In [10]:
err = 100
best_num_estimators = 0
for label, clf_err in error_rate.items():
    num_estimators, error = min(clf_err, key=lambda n: (n[1], -n[0]))
    if error<err: err=error; best_num_estimators = num_estimators; best_leaf = label

print(f'Optimum num of estimators: {best_num_estimators} \nmin_samples_leaf: {best_leaf}')

Re-train the Random Forest Classifier using the new values for the parameters and calculate the AUC/ROC. Include another parameter, the max_features, the number of features to consider when looking for the best split.

In [11]:
### edTest(test_estimators) ### 
estimators_rf = RandomForestClassifier(n_estimators= best_num_estimators,
                                    random_state=seed,
                                    oob_score=True,
                                    min_samples_leaf=best_leaf,
                                    max_features='sqrt') 

# Fit the model on the entire data
estimators_rf.fit(X_train, y_train);

# Calculate AUC/ROC on the test set
y_proba = __[:, 1]
estimators_auc = np.round(roc_auc_score(y_test, y_proba),2)
print(f'Educated RF AUC on test set:{estimators_auc}')

Look at the model's parameters

In [12]:
estimators_rf.get_params()

After we have some idea of the range of optimum values for the number of trees and maybe a couple of other parameters, and have enough computing power, you may perform an exhaustive search over other parameter values.

In [13]:
%%time
from sklearn.model_selection import GridSearchCV

do_grid_search = True

if do_grid_search:
    rf = RandomForestClassifier(n_jobs=-1,
                               n_estimators= best_num_estimators,
                               oob_score=True,
                               max_features = 'sqrt',
                               min_samples_leaf=best_leaf,
                               random_state=seed).fit(X_train,y_train)

    param_grid = {
        'min_samples_split': [2,5,None]}
    
    scoring = {'AUC': 'roc_auc'}
    
    grid_search = GridSearchCV(rf, 
                               param_grid, 
                               scoring=scoring, 
                               refit='AUC', 
                               return_train_score=True, 
                               n_jobs=-1)
    
    results = grid_search.fit(X_train, y_train)
    print(results.best_estimator_.get_params())
    best_rf = results.best_estimator_
    # Calculate AUC/ROC
    y_proba = best_rf.predict_proba(X_test)[:, 1]
    auc = np.round(roc_auc_score(y_test, y_proba),2)
    print(f'GridSearchCV RF AUC on test set:{auc}')