Key Word(s): Logistic Regression, Classification
Title¶
Exercise 2 [Not Graded!] - A Walkthrough Example
Description¶
The aim of this exercise is to let you do some analysis on your own with less structure.
Dataset Description:¶
The dataset used here is the Wine data set (another commonly used sklearn dataset). Use this to answer the questions embedded in the Notebook.
Instructions:¶
- Read the data.
- Do some explorations.
- Fit some multiclass models.
- Interpret these models.
Hints:¶
sklearn.LogisticRegression() : Generates a Logistic Regression classifier
sklearn.fit() : Fits the model to the given data
sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions
sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)
sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated β coefficients in a Logistic Regression model
sklearn.KNeighborsClassifier : Fit a k-NN classification model
Note: This exercise is NOT auto-graded.
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
#import sklearn.metrics as met
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
First Read in the data set and take a peak at it:
raw = datasets.load_wine()
X_full = pd.DataFrame(raw['data'],columns=raw['feature_names'])
y = raw['target']
print(X_full.shape,y.shape)
X_full.head()
Q1: Perform a 70-30 train_test_split
using random_state=109
and shuffle=True
. Why is it important to shuffle here?
### your code here
your answer here
Q2: Explore the data a little. Visualize the marginal association (aka, bivariate relationship) of wine type to amount of alcohol, level of malic acid, and total level of phenols. Which predictor seems to have the strongest association with the response?
### your code here
your answer here
Q3: Fit 3 different models with ['alcohol','malic_acid'] as the predictors: (1) a standard logistic regression to predict a binary indicator for class 0 (you'll have to crete it yourself), (2) a multinomial logistic regression to predict all 3 classes and (3) a OvR logistic reression to predict all 3 classes. Compare the results
###your code here
your answer here
Q4: For the Multinomial model, use the estimated coefficients to calculate the predicted probabilties by hand. Feel free to confirm with the predict_proba
command.
### your code here
your answer here
Q5: For the OvR model, use the predict_proba()
to estimate the predicted probabilities in the test set, and manually use this to calculate the predicted classes. Feel free to confirm with the predict
command.
### your code here
Q6: How could you use the predict_proba()
command and 'change the threshold' in the multiclass setting to affect predictive accuracies within each class? Note: it is not as simple as changing a threshold because there is not threshold
your answer here
Q7: Compare the accuracies in both train and test for both the multinomial and OvR logistic regressions. Which seems to be performing better? Is there any evidence of overfitting? How could this be corrected?
### your code here
your answer here
Q8: Create the classification boundaries for the two multiclass logistic regression models above. How do they compare?
###your code here
your answer here
Q9: Fit 3 different knn regression models: for $k = 3, 10, 30$. Visualize the classification boundaries for these 3 models and compare the results. Which seem to be overfit?
### your answer here
### your answer here
your answer here
Q10 How could you visualize the classification boundary for any of these models if there was a single predictor? What if there were more than 2 predictors?
your answer here