Title

Exercise: Feature Importance

Description

The goal of this exercise is to compare two feature importance methods; MDI, and Permutation Importance. For a discussion on the merits of each see.

Instructions:

  • Read the dataset heart.csv as a pandas dataframe, and take a quick look at the data.
  • Assign the predictor and response variables as per the instructions given in the scaffold.
  • Set a max_depth value.
  • Define a DecisionTreeClassifier and fit on the entire data.
  • Define a RandomForestClassifier and fit on the entire data.
  • Calculate Permutation Importance for each of the two models. Remember that the MDI is automatically computed by sklearn when you call the classifiers.
  • Use the routines provided to display the feature importance of bar plots. The plots will look similar to the one given above.

Hints:

forest.feature_importances_ : Calculate the impurity-based feature importance.

sklearn.inspection.permutation_importance() : Calculate the permutation-based feature importance.

sklearn.RandomForestClassifier() : Returns a random forest classifier object.

sklearn.DecisionTreeClassifier(): Returns a decision tree classifier object.

NOTE - MDI is automatically computed by sklearn by calling RandomForestClassifier and/or DecisionTreeClassifier.

In [ ]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier
from helper import plot_permute_importance, plot_feature_importance

%matplotlib inline
In [ ]:
# Read the dataset and take a quick look

df = pd.read_csv("heart.csv")

df.head()
In [ ]:
# Assign the predictor and response variables.

# 'AHD' is the response and all the other columns are the predictors
X = ___
y = ___
In [ ]:
# Set the parameters

# The random state is fized for testing purposes

random_state = 44

# Choose a `max_depth` for your trees 

max_depth = ___

SINGLE TREE

In [ ]:
### edTest(test_decision_tree) ###

# Define a Decision Tree classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth

tree = __

# Fit the model on the entire data
tree.fit(X, y);

# Using Permutation Importance to get the importance of features for the Decision Tree 
# With random_state as the above defined variable

tree_result = ___

RANDOM FOREST

In [ ]:
### edTest(test_random_forest) ###

# Define a Random Forest classifier with random_state as the above defined variable
# Set the maximum depth to be max_depth and use 10 estimators

forest = ___

# Fit the model on the entire data

forest.fit(X, y);

# Use Permutation Importance to get the importance of features for the Random Forest model 
# With random_state as the above defined variable

forest_result = ___

PLOTTING THE FEATURE RANKING

In [ ]:
# Use the helper code given to visualize the feature importance using 'MDI'

plot_feature_importance(tree,forest,X,y);

# Use the helper code given to visualize the feature importance using 'permutation feature importance'

plot_permute_importance(tree_result,forest_result,X,y);

Mindchow 🍲

Q1. A common criticism for the MDI method is that it assigns a lot of importance to noisy features (more here). Did you make such an observation in the plots above?

Your answer here

Q2. After marking, change the max_depth for your classifiers to a very low value such as $3$, and see if you see a change in the relative importance of predictors.

Your answer here