Title¶

Exercise: Feature Importance

Description¶

The goal of this exercise is to compare two feature importance methods; MDI, and Permutation Importance. For a discussion on the merits of each see.

Instructions:¶

Read the dataset heart.csv as a pandas dataframe, and take a quick look at the data.
Assign the predictor and response variables as per the instructions given in the scaffold.
Set a max_depth value.
Define a DecisionTreeClassifier and fit on the entire data.
Define a RandomForestClassifier and fit on the entire data.
Calculate Permutation Importance for each of the two models. Remember that the MDI is automatically computed by sklearn when you call the classifiers.
Use the routines provided to display the feature importance of bar plots. The plots will look similar to the one given above.

Hints:¶

forest.feature_importances_ : Calculate the impurity-based feature importance.

sklearn.inspection.permutation_importance() : Calculate the permutation-based feature importance.

sklearn.RandomForestClassifier() : Returns a random forest classifier object.

sklearn.DecisionTreeClassifier(): Returns a decision tree classifier object.

NOTE - MDI is automatically computed by sklearn by calling RandomForestClassifier and/or DecisionTreeClassifier.

In [ ]:

# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from helper import plot_permute_importance, plot_feature_importance

%matplotlib inline

In [ ]:

# Read the dataset and take a quick look

df = pd.read_csv("heart.csv")

df.head()

In [ ]:

# Assign the predictor and response variables.

# 'AHD' is the response and all the other columns are the predictors
X = ___
y = ___

In [ ]:

# Set the parameters

# The random state is fized for testing purposes

random_state = 44

# Choose a `max_depth` for your trees 

max_depth = ___

SINGLE TREE¶

In [ ]:

### edTest(test_decision_tree) ###

# Define a Decision Tree classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth

tree = __

# Fit the model on the entire data
tree.fit(X, y);

# Using Permutation Importance to get the importance of features for the Decision Tree 
# With random_state as the above defined variable

tree_result = ___

RANDOM FOREST¶

In [ ]:

### edTest(test_random_forest) ###

# Define a Random Forest classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth and use 10 estimators

forest = ___

# Fit the model on the entire data

forest.fit(X, y);

# Use Permutation Importance to get the importance of features for the Random Forest model 
# With random_state as the above defined variable

forest_result = ___

PLOTTING THE FEATURE RANKING¶

In [ ]:

# Use the helper code given to visualize the feature importance using 'MDI'

plot_feature_importance(tree,forest,X,y);

# Use the helper code given to visualize the feature importance using 'permutation feature importance'

plot_permute_importance(tree_result,forest_result,X,y);

Mindchow 🍲¶

Q1. A common criticism for the MDI method is that it assigns a lot of importance to noisy features (more here). Did you make such an observation in the plots above?

Your answer here

Q2. After marking, change the max_depth for your classifiers to a very low value such as $3$, and see if you see a change in the relative importance of predictors.

Your answer here