Key Word(s): Bagging, Bootstrap, Aggregate, Underfitting, Overfitting, Out of Bag Error (OOB), Variable Importance, Random Forest, hyper-parameters, MDI
Title :¶
Exercise: Feature Importance
The goal of this exercise is to compare two feature importance methods; MDI, and Permutation Importance. For a discussion on the merits of each go to this link.
Description :¶
Instructions:¶
- Read the dataset
heart.csv
as a pandas dataframe, and take a quick look at the data. - Assign the predictor and response variables as per the instructions given in the scaffold.
- Set a max_depth value.
- Define a
DecisionTreeClassifier
and fit on the entire data. - Define a
RandomForestClassifier
and fit on the entire data. - Calculate Permutation Importance for each of the two models. Remember that the MDI is automatically computed by sklearn when you call the classifiers.
- Use the routines provided to display the feature importance of bar plots. The plots will look similar to the one given above.
Hints:¶
forest.feature_importances_ Calculate the impurity-based feature importance.
sklearn.inspection.permutation_importance() Calculate the permutation-based feature importance.
sklearn.RandomForestClassifier() Returns a random forest classifier object.
sklearn.DecisionTreeClassifier() Returns a decision tree classifier object.
NOTE - MDI is automatically computed by sklearn by calling RandomForestClassifier and/or DecisionTreeClassifier.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from helper import plot_permute_importance, plot_feature_importance
%matplotlib inline
# Read the dataset "heart.csv"
df = pd.read_csv("heart.csv")
# Take a quick look at the data
df.head()
# Assign the predictor and response variables.
# 'AHD' is the response and all the other columns are the predictors
X = ___
y = ___
# Set the model parameters
# The random state is fized for testing purposes
random_state = 44
# Choose a `max_depth` for your trees
max_depth = ___
SINGLE TREE¶
### edTest(test_decision_tree) ###
# Define a Decision Tree classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth
tree = ___
# Fit the model on the entire data
tree.fit(X, y);
# Using Permutation Importance to get the importance of features for the Decision Tree
# with random_state as the above defined variable
tree_result = ___
RANDOM FOREST¶
### edTest(test_random_forest) ###
# Define a Random Forest classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth and use 10 estimators
forest = ___
# Fit the model on the entire data
forest.fit(X, y);
# Use Permutation Importance to get the importance of features for the Random Forest model
# with random_state as the above defined variable
forest_result = ___
PLOTTING THE FEATURE RANKING¶
# Helper code to visualize the feature importance using 'MDI'
plot_feature_importance(tree,forest,X,y);
# Helper code to visualize the feature importance using 'permutation feature importance'
plot_permute_importance(tree_result,forest_result,X,y);
⏸ A common criticism for the MDI method is that it assigns a lot of importance to noisy features (more here). Did you make such an observation in the plots above?
### edTest(test_chow1) ###
# Type your answer within in the quotes given
answer1 = '___'
⏸ After marking, change the max_depth for your classifiers to a very low value such as 3 3, and see if you see a change in the relative importance of predictors.
### edTest(test_chow2) ###
# Type your answer within in the quotes given
answer2 = '___'