Key Word(s): Logistic Regression, Classification

Title¶

Exercise 1 - Basic Multi-classification

Description¶

The goal of the exercise is to get comfortable using multiclass classification models. Eventually, you will produce a plot similar to the one given below:

Instructions:¶

We are trying to predict the types of Irises in the classic Iris data set based on measured characteristics

Load the Iris data set and convert to a data frame.
Fit multinomial & OvR logistic regressions and a $k$-NN model.
Compute the accuracy of the models.
Plot the classification boundaries against the two predictors used.

Hints:¶

sklearn.LogisticRegression() : Generates a Logistic Regression classifier

sklearn.fit() : Fits the model to the given data

sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions

sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)

sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\beta$ coefficients in a Logistic Regression model

sklearn.score() : Accuracy classification score.

matplotlib.pcolormesh() : Accuracy classification score

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:

%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

Irises¶

Read in the data set and convert to a Pandas data frame:

In [3]:

raw = datasets.load_iris()
iris = pd.DataFrame(raw['data'],columns=raw['feature_names'])
iris['type'] = raw['target'] 
iris.head()

Note: this violin plot is 'inverted': putting the response variable in the model on the x-axis. This is fine for exploration

In [4]:

sns.violinplot(y=iris['sepal length (cm)'], x=iris['type'], split=True);

In [5]:

# Create a violin plot to compare petal length 
# across the types of irises

sns.violinplot(___);

Here we fit our first model (the OvR logistic) and print out the coefficients:

In [7]:

logit_ovr = LogisticRegression(penalty='none', multi_class='ovr',max_iter = 1000).fit(
    iris[['sepal length (cm)','sepal width (cm)']], iris['type'])
print(logit_ovr.intercept_)
print(logit_ovr.coef_)

In [10]:

# we can predict classes or probabilities
print(logit_ovr.predict(iris[['sepal length (cm)','sepal width (cm)']])[0:5])
print(logit_ovr.predict_proba(iris[['sepal length (cm)','sepal width (cm)']])[0:5])

In [21]:

# and calculate accuracy
print(logit_ovr.score(iris[['sepal length (cm)','sepal width (cm)']],iris['type']))

Now it's your turn: but this time with the multinomial logistic regression.

In [22]:

### edTest(test_multinomial) ###

# Fit the model and print out the coefficients
logit_multi = LogisticRegression(___).fit(___)
intercept = logit_multi.intercept_
coefs = logit_multi.coef_
print(intercept)
print(coefs)

In [24]:

### edTest(test_multinomialaccuracy) ###

multi_accuracy = ___
print(multi_accuracy)

In [33]:

# Plot the decision boundary. 
x1_range = iris['sepal length (cm)'].max() - iris['sepal length (cm)'].min()
x2_range = iris['sepal width (cm)'].max() - iris['sepal width (cm)'].min()
x1_min, x1_max = iris['sepal length (cm)'].min()-0.1*x1_range, iris['sepal length (cm)'].max() +0.1*x1_range
x2_min, x2_max = iris['sepal width (cm)'].min()-0.1*x2_range, iris['sepal width (cm)'].max() + 0.1*x2_range

step = .05 
x1x, x2x = np.meshgrid(np.arange(x1_min, x1_max, step), np.arange(x2_min, x2_max, step))
y_hat_ovr = logit_ovr.predict(np.c_[x1x.ravel(), x2x.ravel()])
y_hat_multi = ___


fig, (ax1, ax2) = plt.subplots(1, 2,  figsize=(15, 6))

ax1.pcolormesh(x1x, x2x, y_hat_ovr.reshape(x1x.shape), cmap=plt.cm.Paired,alpha = 0.5)
ax1.scatter(iris['sepal length (cm)'], iris['sepal width (cm)'], c=iris['type'], edgecolors='k', cmap=plt.cm.Paired)

### your job is to create the same plot, but for the multinomial
#####
# your code here
#####

plt.show()

In [27]:

#fit a knn model (k=5) for the same data 
knn5 = KNeighborsClassifier(___).fit(___)

In [ ]:

### edTest(test_knnaccuracy) ###

#Calculate the accuracy
knn5_accuracy = ___
print(knn5_accuracy)

In [32]:

# and plot the classification boundary

y_hat_knn5 = knn5.predict(np.c_[x1x.ravel(), x2x.ravel()])

fig, ax1 = plt.subplots(1, 1,  figsize=(8, 6))

ax1.pcolormesh(x1x, x2x, y_hat_knn5.reshape(x1x.shape), cmap=plt.cm.Paired,alpha = 0.5)
# Plot also the training points
ax1.scatter(iris['sepal length (cm)'], iris['sepal width (cm)'], c=iris['type'], edgecolors='k', cmap=plt.cm.Paired)

plt.show()