Title

Exercise 2 [Not Graded!] - A Walkthrough Example

Description

The aim of this exercise is to let you do some analysis on your own with less structure.

Dataset Description:

The dataset used here is the Wine data set (another commonly used sklearn dataset). Use this to answer the questions embedded in the Notebook.

Instructions:

  1. Read the data.
  2. Do some explorations.
  3. Fit some multiclass models.
  4. Interpret these models.

Hints:

sklearn.LogisticRegression() : Generates a Logistic Regression classifier

sklearn.fit() : Fits the model to the given data

sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions

sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)

sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated β coefficients in a Logistic Regression model

sklearn.KNeighborsClassifier : Fit a k-NN classification model

Note: This exercise is NOT auto-graded.

In [ ]:
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
#import sklearn.metrics as met

from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

First Read in the data set and take a peak at it:

In [ ]:
raw = datasets.load_wine()
X_full = pd.DataFrame(raw['data'],columns=raw['feature_names'])
y = raw['target']
print(X_full.shape,y.shape)
In [ ]:
X_full.head()

Q1: Perform a 70-30 train_test_split using random_state=109 and shuffle=True. Why is it important to shuffle here?

In [ ]:
### your code here

your answer here

Q2: Explore the data a little. Visualize the marginal association (aka, bivariate relationship) of wine type to amount of alcohol, level of malic acid, and total level of phenols. Which predictor seems to have the strongest association with the response?

In [ ]:
### your code here

your answer here

Q3: Fit 3 different models with ['alcohol','malic_acid'] as the predictors: (1) a standard logistic regression to predict a binary indicator for class 0 (you'll have to crete it yourself), (2) a multinomial logistic regression to predict all 3 classes and (3) a OvR logistic reression to predict all 3 classes. Compare the results

In [ ]:
###your code here

your answer here

Q4: For the Multinomial model, use the estimated coefficients to calculate the predicted probabilties by hand. Feel free to confirm with the predict_proba command.

In [ ]:
### your code here

your answer here

Q5: For the OvR model, use the predict_proba() to estimate the predicted probabilities in the test set, and manually use this to calculate the predicted classes. Feel free to confirm with the predict command.

In [ ]:
### your code here

Q6: How could you use the predict_proba() command and 'change the threshold' in the multiclass setting to affect predictive accuracies within each class? Note: it is not as simple as changing a threshold because there is not threshold

your answer here

Q7: Compare the accuracies in both train and test for both the multinomial and OvR logistic regressions. Which seems to be performing better? Is there any evidence of overfitting? How could this be corrected?

In [ ]:
### your code here

your answer here

Q8: Create the classification boundaries for the two multiclass logistic regression models above. How do they compare?

In [ ]:
###your code here

your answer here

Q9: Fit 3 different knn regression models: for $k = 3, 10, 30$. Visualize the classification boundaries for these 3 models and compare the results. Which seem to be overfit?

In [ ]:
### your answer here
In [ ]:
### your answer here

your answer here

Q10 How could you visualize the classification boundary for any of these models if there was a single predictor? What if there were more than 2 predictors?

your answer here