Key Word(s): Logistic Regression, Multiple Logistic Regression, classification, decision boundaries, KNN-classification, pipeline, standardization, cross validation

Download Notebook

In [2]:

%%html
<style>
.jp-MarkdownCell {background-color: cornsilk;}
.text_cell {background-color: cornsilk;}
style>

CS109A Introduction to Data Science

Standard Section 5: Logistic Regression, Multiple Logistic Regression, and kNN Classification¶

Harvard University
Fall 2020
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Section Leaders: Marios Mattheakis, Mike Sedelmeyer, N Ouyang

In [28]:

#RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css"
).text
HTML(styles)

Out[28]:

Contents¶

In this section we will be covering Linear Regression, Logistic Regression, and kNN Classification using the Iris dataset.¶

Specifically, we will:

Import and explore data
Fit a LINEAR regression model for classification, understand drawbacks, and interpret results
Fit a simple LOGISTIC regression model for classification, compare performance, and interpret results
Visualize predictions and decision boundaries
Fit a higher order polynomial logistic regression model for classification, compare performance, plot decision boundaries, and interpret results
Consider the impact of different regularization parameters on the above polynomial model
Fit and investigate a kNN CLASSIFIER
Introduce sklearn pipelines

For our breakout exercises, we will:

Standardize our data and fit a simple logistic regression model
Plot and examine decision boundaries
Add an interaction term and use LogisticRegressionCV to cross-validate a regularization parameter

By the end, we hope you will be comfortable building your own classification models.

For this section we will be using the following packages:

In [29]:

# Data and Stats packages
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# NEW packages
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

# Visualization packages
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

1. Import and explore data¶

▲ Return to contents

For Section 5 we will be using datasets available directly within sklearn.

For section examples, we will be using Fisher's famous "Iris" dataset
For our breakout sessions, we will be using the "Wine" dataset

To learn more about sklearn datasets, please see the documentation: https://scikit-learn.org/stable/datasets/index.html

In [30]:

dataset = datasets.load_iris()
print(dataset.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

What our the target and features?¶

In [31]:

print(dataset.target_names)
print(dataset.feature_names)

# Uncomment if you want to see much, much more info on this dataset
# print(dataset['DESCR'])

['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

As we can see, the response variable (y) is the flower type, it has 3 classes:

setosa
versicolor
virginica

These are three different species of Iris flowers.

The 4 predictor variables are flower characteristics (x):

sepal length (cm)
sepal width (cm)
petal length (cm)
petal width (cm)

Although this is a toy dataset, in real datasets it will be important to understand the underlying features. Below is a comparison of the three species and their sepals/petals, as well as the overall flower.

Iris features

Images not loading? See them online:

In [32]:

X = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(data=dataset.target, columns=['species'])
print(X.shape)
print(y.shape)

(150, 4)
(150, 1)

In [33]:

display(X.head())
display(X.describe())

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

In [34]:

display(y.head())
display(y.describe())

	species
0	0
1	0
2	0
3	0
4	0

	species
count	150.000000
mean	1.000000
std	0.819232
min	0.000000
25%	0.000000
50%	1.000000
75%	2.000000
max	2.000000

Explore Data¶

Check which variables have high correlations and distinctive patterns with the response.

Any patterns worth mentioning?

In [35]:

full_df = pd.concat([X,y], axis=1)
full_df.head()

Out[35]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [36]:

sns.pairplot(full_df)
plt.show()

Are the features sepal/petal length and width uniformally distributed or do you observe some clusters of data points?

What do you expect? Let's add color according to our response variable with hue='species'.

In [37]:

sns.pairplot(full_df, hue='species')
plt.show()

Some features like 'petal length' and 'petal width' do have very high correlations and distinctive patterns with the response variable 'flower species'. When we would use these features for predicting the flower species, the classification wouldn't be very difficult. Certain ranges of 'petal length' and 'petal width' are very much correlated with a specific flower species and they are almost seperating our classes perfectly.

Just for illustration purposes we will continue to use only 'sepal width (cm)' and 'sepal length (cm)'. We are making the problem harder for ourselves by only using 'weaker' or less-discriminative features.

In [38]:

X = X[['sepal width (cm)', 'sepal length (cm)']]

Train-Test Split¶

In [39]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [40]:

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)

(100, 2)
(100, 1)
(50, 2)
(50, 1)

2. Fit a LINEAR regression model for classification, understand drawbacks, and interpret results¶

▲ Return to contents

In [16]:

model_linear_sklearn = LinearRegression()

#Training
model_linear_sklearn.fit(X_train, y_train)

#Predict
y_pred_train = model_linear_sklearn.predict(X_train)
y_pred_test = model_linear_sklearn.predict(X_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
      9 
     10 #Perfromance Evaluation
---> 11 train_score = accuracy_score(y_train, y_pred_train)*100
     12 test_score = accuracy_score(y_test, y_pred_test)*100
     13 

~/v3/lib/python3.6/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    183 
    184     # Compute accuracy for each possible representation
--> 185     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    186     check_consistent_length(y_true, y_pred, sample_weight)
    187     if y_type.startswith('multilabel'):

~/v3/lib/python3.6/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     88     if len(y_type) > 1:
     89         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 90                          "and {1} targets".format(type_true, type_pred))
     91 
     92     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

**QUESTION: Can anyone explain why we are getting an error? What is wrong with `y_train` and `y_pred_train`?**

In [41]:

y_train[:5]

Out[41]:

	species
96	1
105	2
66	1
0	0
122	2

In [42]:

y_pred_train[:5]

Out[42]:

array([[1.01761878],
       [2.33100162],
       [0.88117842],
       [0.19897661],
       [2.53139119]])

The fact that our linear regression is outputting continuous predicstions is one of the major drawbacks of linear regression for classification. We can solve this in two manners:

Simply rounding our prediction by using np.round() and converting it to an int data type with .astype(int)

Or, use a modified algorithm that has bounded outputs (more about Logistic Regression later)

In [43]:

np.round(y_pred_train[:5])

Out[43]:

array([[1.],
       [2.],
       [1.],
       [0.],
       [3.]])

In [44]:

np.round(y_pred_train[:5]).astype(int)

Out[44]:

array([[1],
       [2],
       [1],
       [0],
       [3]])

In [45]:

model_linear_sklearn = LinearRegression()

#Training
model_linear_sklearn.fit(X_train, y_train)

#Predict
y_pred_train = np.round(model_linear_sklearn.predict(X_train)).astype(int)
y_pred_test = np.round(model_linear_sklearn.predict(X_test)).astype(int)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

Training Set Accuracy: 73.0%
Testing Set Accuracy: 80.0%

Get Performance by Class (Lookup Confusion Matrix)¶

Each row of the matrix represents the instances in an actual class
Each column represents the instances in a predicted class (or vice versa)
The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

In [46]:

confusion_matrix_linear = pd.crosstab(
    y_test.values.flatten(), 
    y_pred_test.flatten(), 
    rownames=['Actual Class'], 
    colnames=['Predicted Class'],
)
display(confusion_matrix_linear)

Predicted Class	0	1	2	3
Actual Class
0	19	0	0	0
1	0	10	5	0
2	0	4	11	1

**QUESTION: How many classes do we have in our y_test and in our y_pred_test? Why do we have 4 different predicted classes?**

Can we use Linear Regression for classification?

Four Assumptions of Linear Regression:

Linearity: Our dependent variable Y is a linear combination of the explanatory variables X (and the error terms)
Observations are independent of one another
Constant variance of the error terms (i.e. homoscedasticity)
The error terms are normally distributed ~ $N(0,\sigma^2)$

Suppose we have a binary outcome variable. Can we use Linear Regression?

Then we will have the following problems:

The error terms are heteroskedastic
$\epsilon$ is not normally distributed because Y takes on only two values
The predicted probabilities can be greater than 1 or less than 0

Datasets where linear regression is problematic:

Binary response data where there are only two outcomes (yes/no, 0/1, etc.)
Categorical or ordinal data of any type, where the outcome is one of a number of discrete (possibly ordered) classes
Count data in which the outcome is restricted to non-negative integers.
Continuous data in which the noise is not normally distributed

This is where Generalized Linear Models (GLMs), of which logistic regression is a specific type, come to the rescue!

Logistic regression is most useful for predicting binary or multi-class responses.

Linear versus Logistic

3. Fit a simple LOGISTIC regression model for classification, compare performance, and interpret results¶

▲ Return to contents

The logistic regression formula:

$$\hat{p}= \dfrac{e^{w^T x}}{1+e^{w^T x}}$$

This is equivalent to:

$$\hat{p}= \dfrac{1}{1+e^{-w^T x}}$$

Medium Article: Detailed overview of Logistic Regression

In [47]:

#Training
model_logistic = LogisticRegression(C=100).fit(X_train, y_train)

#Predict
y_pred_train = model_logistic.predict(X_train)
y_pred_test = model_logistic.predict(X_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

Training Set Accuracy: 82.0%
Testing Set Accuracy: 80.0%

Let's compare logistic regression against linear regression predictions¶

To do this, we will hold one of our 2 predictors constant so we can easily visualize and compare our prediction curves.

We fix X_train['sepal width (cm)'] to its mean value.
- x_1 = X_train['sepal width (cm)']
- x_1_range = np.ones_like(x_2_range)*x_1.mean()

We vary X_train['sepal length (cm)'] from its minimum to its maximum and look how the predicted class evolves.
- x_2 = X_train['sepal length (cm)']
- x_2_min, x_2_max = x_2.min(), x_2.max()+0.3
- x_2_range = np.arange(x_2_min, x_2_max, 0.003)

In [48]:

# Making our input features (x_2 varying, x_1 constat = mean of x_1)
x_1 = X_train['sepal width (cm)']
x_2 = X_train['sepal length (cm)']

x_2_min, x_2_max = x_2.min()-0.1, x_2.max()+0.3

x_2_range  = np.arange(x_2_min, x_2_max, 0.003)
x_1_range  = np.ones_like(x_2_range)*x_1.mean()

# Construct our input features
X_with_varying_x_2 = np.stack(
    [x_1_range.ravel(), x_2_range.ravel()],
    axis = 1
)

# Make linear Predictions
prediction_linear = model_linear_sklearn.predict(X_with_varying_x_2)

# Make logistic Predictions
prediction_proba = model_logistic.predict_proba(X_with_varying_x_2)
prediction_thresholded = model_logistic.predict(X_with_varying_x_2)

f, axes = plt.subplots(1,2, figsize=(14,6))

# Plot Linear Predictions
axes[0].plot(
    x_2_range, prediction_linear,
    'k--',
    label= 'Linear Output (raw = continuous)'
)
axes[0].plot(
    x_2_range, np.round(prediction_linear),
    'k-',
    alpha=1,
    label= 'Linear Predicted Class (rounded = integer)',
)
axes[0].plot(
    x_2_range, np.round(prediction_thresholded),
    'r-',
    alpha=0.5,
    label= 'Logistic Predicted Class (as shown on right)',
)
axes[0].legend()
axes[0].set_title(
    'LINEAR Regression:\nRaw output and rounded output',
    fontsize=14,
)
axes[0].set_yticks([0, 1, 2, 3])
axes[0].set_ylabel('Predicted Class', fontsize=12)

# Plot Logistic Predictions
for i in sorted(set(prediction_thresholded)):
    axes[1].plot(
        x_2_range, prediction_proba[:,i],
        linewidth=2,
        label= '$\hat{{P}}\:(Y={})$'.format(i)
    )
    axes[1].fill_between(
        x_2_range[prediction_thresholded==i], 1, 0,
        alpha=0.2,
        edgecolor='gray',
    )
axes[1].legend()
axes[1].set_title(
    "LOGISTIC Regression: predicted probability\nper class and the predicted class",
    fontsize=14,
)
axes[1].set_ylabel('Predicted Probability', fontsize=12)

for ax in axes.flat:
    ax.tick_params(labelsize=12)
    ax.set_xlabel('sepal width (cm)', fontsize=12)
    ax.legend(fontsize=11, framealpha=1, edgecolor='k')

plt.tight_layout()
plt.show()

How does our Logistic Regression come up with mutiple class predictions?¶

Each class $y_i$ has a sigmoid function that tries to predict the probability of the tested input belonging to that specific class $y_i$.
- In our case when we have 3 classes, thus we have 3 sigmoid functions (the blue, orange and green line in the right figure).

LogisticRegression().predict_proba(...) : returns probability estimates $P(y_i|x)$ for each $y_i$. In our case .predict_proba(...) returns 3 values (one for each class). In the figure we observe that :
- we have a high probability of predicting Class 0 in regions with low 'sepal width' values (left).
- we have a high probability of predicting Class 1 in regions with medium 'sepal width' regions (middle).
- have a high probability of predicting Class 2 in regions with high 'sepal width' regions (right).

LogisticRegression().predict(...): returns 1 value: the predicted class label. The class with the highest probability given by .predict_proba(...) is exactly the predicted class output of .predict(...)
- In the figure our final prediction is the red line.

Get Performance by Class (Lookup Confusion Matrix)¶

In [49]:

confusion_matrix_logistic = pd.crosstab(
    y_test.values.flatten(), 
    y_pred_test.flatten(), 
    rownames=['Actual Class'], 
    colnames=['Predicted Class'],
)
display(confusion_matrix_logistic)

Predicted Class	0	1	2
Actual Class
0	19	0	0
1	0	8	7
2	0	3	13

Let's compare the confusion matrix of our linear model to the confusion matrix of our logistic regression model:

In [50]:

print('======================================')
print('Confusion Matrix Linear Regression:')
display(confusion_matrix_linear)

print('\n======================================')
print('Confusion Matrix Logistic Regression:')
display(confusion_matrix_logistic)
print('======================================')

======================================
Confusion Matrix Linear Regression:

Predicted Class	0	1	2	3
Actual Class
0	19	0	0	0
1	0	10	5	0
2	0	4	11	1

======================================
Confusion Matrix Logistic Regression:

Predicted Class	0	1	2
Actual Class
0	19	0	0
1	0	8	7
2	0	3	13

======================================

Now we do observe that the logistic regression has the correct number of predicted classes.

Breakout Session #1¶

▲ Return to contents

Fit and examine a multiclass logistic regression model

Load and get to know a new dataset
Select our predictors and response variable
Generate a train-test-split
Scale our data
Perform logistic regression and investigate our results

wine

Image source: https://en.wikipedia.org/wiki/Wine

In [51]:

dataset_wine = datasets.load_wine()
print(dataset_wine.keys())
print()
print(dataset_wine['DESCR'])

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

In [52]:

X_wine = pd.DataFrame(data=dataset_wine.data, columns=dataset_wine.feature_names)
y_wine = pd.DataFrame(data=dataset_wine.target, columns=['class'])
print(X_wine.shape)
print(y_wine.shape)

(178, 13)
(178, 1)

In [53]:

full_df_wine = pd.concat([X_wine, y_wine], axis=1)
full_df_wine.head()

Out[53]:

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

Uncomment and run the pairplot in the cell below if you wish to see a scatter-matrix of all variables contained in the wine dataset.

Please note: This plotting code will take a minute or so to run.

In [54]:

# sns.pairplot(full_df_wine, hue='class')
# plt.suptitle('Wine dataset features, colored by cultivator class', fontsize=30, y=1)
# plt.show()

The wine dataset provides many variables to use in our model.
Because we have limited time for this exercise, we will pretend we have already conducted EDA on our dataset and have chosen to focus solely on the predictors alcohol and flavanoids.
Run the code below to subset our full wine dataframe and to perform our train-test-split.

In [55]:

predictors = ['alcohol', 'flavanoids']

full_df_wine_train, full_df_wine_test, = train_test_split(
    full_df_wine[predictors+['class']],
    test_size=0.2,
    random_state=109,
    shuffle=True,
    stratify=full_df_wine['class'],
)

Now let's take a quick look at our training data to get a sense for how alcohol and flavanoids relate to one another, as well as to our cultivator class response variable.

In [56]:

sns.scatterplot(
    data=full_df_wine_train,
    x='alcohol', y='flavanoids',
    hue='class',
)
plt.title(
    'Wine training observations,\ncolored by cultivator class',
    fontsize=14
)
plt.show()

What do you notice about the scale of our two predictor variables?

It looks like the mean is shifted, but the total range of values are similar.
Let's standardize our predictors to center our points and to ensure that both variables are commonly scaled.
IMPORTANT: Remember to fit your scaler ONLY on your training data and to transform both train and test from that training fit. NEVER fit your scaler on the test set.

In [57]:

###################################
## SUBSET OUR X AND y DATAFRAMES ##
###################################

X_wine_train, y_wine_train = full_df_wine_train[predictors], full_df_wine_train['class'] 
X_wine_test, y_wine_test = full_df_wine_test[predictors], full_df_wine_test['class']

print(
    "Summary statistics for our training predictors BEFORE standardizing:\n\n"
    "{}\n".format(X_wine_train.describe())
)

##########################
## SCALE THE PREDICTORS ##
##########################

# Be certain to ONLY EVER fit your scaler on X train (NEVER fit it on test)
scaler = StandardScaler().fit(X_wine_train[predictors])

# Use your train-fitted scaler to transform both X train and X test
X_wine_train[predictors] = scaler.transform(X_wine_train[predictors])
X_wine_test[predictors] = scaler.transform(X_wine_test[predictors])

print(
    "Summary statistics for our training predictors AFTER standardizing:\n\n"
    "{}".format(X_wine_train.describe())
)

Summary statistics for our training predictors BEFORE standardizing:

          alcohol  flavanoids
count  142.000000  142.000000
mean    12.948028    2.035634
std      0.806990    1.001051
min     11.030000    0.340000
25%     12.345000    1.205000
50%     12.945000    2.135000
75%     13.580000    2.897500
max     14.830000    5.080000

Summary statistics for our training predictors AFTER standardizing:

            alcohol    flavanoids
count  1.420000e+02  1.420000e+02
mean   1.238446e-15  2.001529e-16
std    1.003540e+00  1.003540e+00
min   -2.385181e+00 -1.699850e+00
25%   -7.499012e-01 -8.326991e-01
50%   -3.765707e-03  9.961326e-02
75%    7.858943e-01  8.640091e-01
max    2.340343e+00  3.051936e+00

### Student Exercise: 1. Fit a logistic regression model (name it `model1_wine`) without regularization (i.e. `penalty='none'`) to predict the `class` of each wine using our scaled predictors `alcohol` and `flavanoids` 2. Report on the training and test accuracy of our fitted model 3. Show that for each training prediction, the probability of each predicted class sums to one. (**HINT:** You can use the `predict_proba` method to generate your predicted probabilities.)

In [58]:

# Give it a shot! 

model1_wine = ...

#### Solution After you've attempted the above exercise, uncomment and run the code below.

In [59]:

# %load '../solutions/breakout_1_sol.py'

4. Visualize Predictions and Decision boundaries¶

▲ Return to contents

What are decision boundaries:

In general, a pattern classifier carves up (or tesselates or partitions) the feature space into volumes called decision regions.
All feature vectors in a decision region are assigned to the same category.

In [35]:

X_train.head()

Out[35]:

	sepal width (cm)	sepal length (cm)
96	2.9	5.7
105	3.0	7.6
66	3.0	5.6
0	3.5	5.1
122	2.8	7.7

In [36]:

def plot_points(ax):
    for i, y_class in enumerate(set(y_train.values.flatten())):
        index = (y_train == y_class).values
        ax.scatter(
            X_train[index]['sepal width (cm)'],
            X_train[index]['sepal length (cm)'],
            c=colors[i],
            marker=markers[i],
            s=65, 
            edgecolor='w',
            label=names[i],
        )

In [37]:

f, ax = plt.subplots(1, 1, figsize=(7, 6))

colors = ["tab:blue", "tab:orange","tab:green"]
markers = ["s", "o", "D"]
names = dataset.target_names

plot_points(ax)

ax.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
ax.set_title('Classes of Flowers')
ax.set_ylabel('sepal length (cm)')
ax.set_xlabel('sepal width (cm)')
plt.tight_layout()
plt.show()

Plotting the decision boundary by using the following functions:¶

np.meshgrid()
- meshgrid returns coordinate matrices from coordinate vectors, effectively constructing a grid.
- The documentation: https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html

plt.contourf()
- contourf draws filled contours.
- The documentation: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.contourf.html#matplotlib.pyplot.contourf

What is meshgrid? A quick numpy dive...¶

How does meshgrid work?

Creates two, 2D arrays of all the points in the grid: one for x coordinates, one for y coordinates

meshgrid example from geeksforgeeks

We will go step-by-step to create a 3-by-4 grid below:

In [38]:

x_vector = np.linspace(-6, 6, 3)  # start, stop, num pts
y_vector = np.linspace(-2, 1, 4) 

print(f"x_vector: {x_vector} \t shape: {x_vector.shape}")
print(f"y_vector: {y_vector} \t shape: {y_vector.shape}")

x_vector: [-6.  0.  6.] 	 shape: (3,)
y_vector: [-2. -1.  0.  1.] 	 shape: (4,)

In [39]:

# Return 2D x,y points of a grid
x_grid, y_grid = np.meshgrid(x_vector, y_vector)

print(f"x_grid, shape {x_grid.shape}:\n\n{x_grid}\n")
print(f"y_grid, shape {y_grid.shape}:\n\n{y_grid}")

x_grid, shape (4, 3):

[[-6.  0.  6.]
 [-6.  0.  6.]
 [-6.  0.  6.]
 [-6.  0.  6.]]

y_grid, shape (4, 3):

[[-2. -2. -2.]
 [-1. -1. -1.]
 [ 0.  0.  0.]
 [ 1.  1.  1.]]

Additional numpy methods you will notice used in our decision boundary plots:

np.ravel(), which returns a contiguous flattened array.
np.stack(), which concatenates a sequence of arrays along a new axis.

In [40]:

x_grid_ravel = x_grid.ravel()
y_grid_ravel = y_grid.ravel()

print(f"x_grid.ravel(), shape {x_grid_ravel.shape}:\n\n{x_grid_ravel}\n")
print(f"y_grid.ravel(), shape {y_grid_ravel.shape}:\n\n{y_grid_ravel}")

x_grid.ravel(), shape (12,):

[-6.  0.  6. -6.  0.  6. -6.  0.  6. -6.  0.  6.]

y_grid.ravel(), shape (12,):

[-2. -2. -2. -1. -1. -1.  0.  0.  0.  1.  1.  1.]

In [41]:

xy_grid_stacked = np.stack(
    (x_grid.ravel(), y_grid.ravel()),
    axis=1
) 

print(f"x and y stacked as columns in output:\n\n{xy_grid_stacked}\n")

xy_grid_stacked = np.stack(
    (x_grid.ravel(), y_grid.ravel()),
    axis=0
)

print(f"x and y stacked as rows in output:\n\n{xy_grid_stacked}")

x and y stacked as columns in output:

[[-6. -2.]
 [ 0. -2.]
 [ 6. -2.]
 [-6. -1.]
 [ 0. -1.]
 [ 6. -1.]
 [-6.  0.]
 [ 0.  0.]
 [ 6.  0.]
 [-6.  1.]
 [ 0.  1.]
 [ 6.  1.]]

x and y stacked as rows in output:

[[-6.  0.  6. -6.  0.  6. -6.  0.  6. -6.  0.  6.]
 [-2. -2. -2. -1. -1. -1.  0.  0.  0.  1.  1.  1.]]

Now, with our `numpy` review over, let's get back to plotting our boundaries!¶

First, we will create a very fine (spacing of 0.003!) np.meshgrid of points to evaluate and color.

In [42]:

x_1 = X_train['sepal width (cm)']
x_2 = X_train['sepal length (cm)']

# Just for illustration purposes we use a margin of 0.2 to the
# left, right, top and bottum of our minimal and maximal points.
# This way our minimal and maximal points won't lie exactly 
# on the axis. 

x_1_min, x_1_max = x_1.min() - 0.2, x_1.max() + 0.2
x_2_min, x_2_max = x_2.min() - 0.2, x_2.max() + 0.2

xx_1, xx_2 = np.meshgrid(
    np.arange(x_1_min, x_1_max, 0.003),
    np.arange(x_2_min, x_2_max, 0.003),
)

Then, we will generate predictions for each point in our grid and use plt.contourf to plot those results!

In [43]:

# Plotting decision regions
f, ax = plt.subplots(1, 1, figsize=(7, 6))

X_mesh = np.stack((xx_1.ravel(), xx_2.ravel()),axis=1)

Z = model_logistic.predict(X_mesh)
Z = Z.reshape(xx_1.shape)

# contourf(): Takes in x,y, and z values -- all 2D arrays of same size
ax.contourf(xx_1, xx_2, Z, alpha=0.3, colors=colors, levels=2)

plot_points(ax)
    
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
ax.set_title('Classes of Flowers')
ax.set_ylabel('sepal length (cm)')
ax.set_xlabel('sepal width (cm)')

plt.tight_layout()
plt.show()

Why are the decision boundaries of this Logistic Regression linear?¶

Imagine the simple case where we have only a 2 class classification problem:

The logistic regression formula can be written as: $$ \hat{p} = \cfrac{e^{w^T x}}{1+e^{w^T x}} $$

Dividing through by the numerator, This is equivalent to $$ \hat{p} = \cfrac{1}{1+e^{-w^T x}} $$

Expanding $w^T x$, we have $x_1$ (sepal width), $x_2$ (sepal length), and our intercept $x_0$ (constant = 1):

$$ w^T x = \begin{bmatrix} w_0 & w_1 & w_2 \\ \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} $$

Which makes our formula for $\hat{p}$ equivalent to:

$$ \hat{p} = \cfrac{1}{1 +e^{\displaystyle -(w_0 \cdot 1 + w_1 \cdot x_1 + w_2 \cdot x_2)}} $$

Since we don't use multiple higher order polynomial features like $x_1^2, x_2^2$, our logistic model only depends on the first order simple features $x_1$ and $x_2$.

What do we have to do to find the the decision boundary?

The decision boundaries are exactly at the position where our algorithm "hesitates" when predicting which class to classify. The output probability of our sigmoid (or softmax) is exactly 0.5. Solving our sigmoid function for $p=0.5$:

$$ \begin{split} 0.5 &= \cfrac{1}{1 +e^{\displaystyle -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2)}} \\ 0.5 &= \cfrac{1}{1 + 1} \\ e^{\displaystyle -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2)} &= 1 \\ & \\ -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2 ) &= 0 \end{split} $$

When we only use two predictor features this constraint of $p=0.5$ results in a linear system; thus we observe a linear decision boundary.

In our case when we have three classes.

5. Fit a higher order polynomial logistic regression model for classification, compare performance, plot decision boundaries, and interpret results¶

▲ Return to contents

In this section we will create degree-2 polynomial features for both of our predictors and re-fit our model and examine the results.

In [44]:

X_train.head()

Out[44]:

	sepal width (cm)	sepal length (cm)
96	2.9	5.7
105	3.0	7.6
66	3.0	5.6
0	3.5	5.1
122	2.8	7.7

In [45]:

X_train_poly = X_train.copy()
X_train_poly['sepal width (cm)^2'] = X_train['sepal width (cm)']**2
X_train_poly['sepal length (cm)^2'] = X_train['sepal length (cm)']**2

X_test_poly = X_test.copy()
X_test_poly['sepal width (cm)^2'] = X_test_poly['sepal width (cm)']**2
X_test_poly['sepal length (cm)^2'] = X_test_poly['sepal length (cm)']**2
X_test_poly.head()

Out[45]:

	sepal width (cm)	sepal length (cm)	sepal width (cm)^2	sepal length (cm)^2
73	2.8	6.1	7.84	37.21
18	3.8	5.7	14.44	32.49
118	2.6	7.7	6.76	59.29
78	2.9	6.0	8.41	36.00
76	2.8	6.8	7.84	46.24

In [46]:

#Training
model_logistic_poly = LogisticRegression(penalty="none").fit(X_train_poly, y_train)

#Predict
y_pred_train = model_logistic_poly.predict(X_train_poly)
y_pred_test = model_logistic_poly.predict(X_test_poly)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

Training Set Accuracy: 81.0%
Testing Set Accuracy: 72.0%

**Our test performance is decreasing, what might be happening?**

How would you test if this is happening?
How would you inhibit this phenomenon from happening?

In [47]:

# Plotting decision regions
f, ax = plt.subplots(1, 1, figsize=(6, 6))

X_mesh_poly = np.stack(
    (xx_1.ravel(), xx_2.ravel(), xx_1.ravel()**2,xx_2.ravel()**2),
    axis=1,
)

Z = model_logistic_poly.predict(X_mesh_poly)
Z = Z.reshape(xx_1.shape)

ax.contourf(xx_1, xx_2, Z, alpha=0.3, colors=colors, levels=2)

plot_points(ax)
    
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
ax.set_title('Classes of Flowers')
ax.set_ylabel('sepal length (cm)')
ax.set_xlabel('sepal width (cm)')
    
plt.show()

**What do you observe regarding the form of the decision boundaries? Does this make sense?**

Decision boundaries for higher order logistic regression¶

Let's return to the mathematical representation of our logistic regression model:

$$\hat{p}= \dfrac{e^{w^T x}}{1+e^{w^T x}}$$

Which is equivalent to:

$$\hat{p}= \dfrac{1}{1+e^{-w^T x}}$$

Now we use $x_1$ (sepal width), $x_2$ (sepal length), an intercept $x_0$ (constant =1), PLUS two higher order terms while making predictions:

$x_1^2 = [\text{sepal width}]^2$
$x_2^2 = [\text{sepal length}]^2$

$$\hat{p}= \cfrac{1}{1+e^{\displaystyle -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2)}}$$

Now solving for $p=0.5$ results in an equation also dependent on $x_1^2$ and $x_2^2$: thus we observe non-linear decision boundaries:

$$ \begin{split} 0.5 &= \cfrac{1}{1 +e^{\displaystyle -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2)}} \\ 0.5 &= \cfrac{1}{1 + 1} \\ e^{\displaystyle -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2)} &= 1 \\ & \\ -(w_0 \cdot 1 + w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_2^2) &= 0 \end{split} $$

Breakout Session #2¶

▲ Return to contents

Plot boundaries for our prior breakout model

In this session, we will plot decision boundaries for the model1_wine model we fit during Breakout Session #1.

### Student Exercise: Using the `plot_wine_2d_boundaries` function provided below, you are going to plot the TRAINING data decision boundaries for `model1_wine` from breakout session 1 above. **To do this, you will need to:** 1. Define an appropriate `xx_1_wine` and `xx_2_wine` for input into this function using `np.meshgrid` 2. And call `plot_wine_2d_boundaries` specifying the appropriate input parameters

In [48]:

# We give you the plotting function here
def plot_wine_2d_boundaries(X_data, y_data, predictors, model, xx_1, xx_2):
    """Plots 2-dimensional decision boundaries for a fitted sklearn model
    
    :param X_data: pd.DataFrame object containing your predictor variables
    :param y_data: pd.Series object containing your response variable
    :param predictors: list of predictor names corresponding with X_data columns
    :param model: sklearn fitted model object
    :param xx_1: np.array object of 2-dimensions, generated using np.meshgrid
    :param xx_2: np.array object of 2-dimensions, generated using np.meshgrid
    """

    def plot_points(ax):
        for i, y_class in enumerate(set(y_data.values.flatten())):
            index = (y_data == y_class).values
            ax.scatter(
                X_data[index][predictors[0]],
                X_data[index][predictors[1]],
                c=colors[i],
                marker=markers[i],
                s=65, 
                edgecolor='w',
                label="class {}".format(i),
            )

    # Plotting decision regions
    f, ax = plt.subplots(1, 1, figsize=(7, 6))

    X_mesh = np.stack((xx_1.ravel(), xx_2.ravel()), axis=1)

    Z = model.predict(X_mesh)
    Z = Z.reshape(xx_1.shape)

    ax.contourf(xx_1, xx_2, Z, alpha=0.3, colors=colors, levels=2)

    plot_points(ax)

    ax.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
    ax.set_title(
        'Wine cultivator class prediction\ndecision boundaries',
        fontsize=16
    )
    ax.set_xlabel(predictors[0], fontsize=12)
    ax.set_ylabel(predictors[1], fontsize=12)

    plt.tight_layout()
    plt.show()

#### Enter your code below

In [ ]:

## Your code here

# Give breakout 2 a try! Just adapt the iris code to the wine dataset

# Create your meshgrid arrays similar to xx_1 and xx_2
# Be certain to use reasonable min and max bounds for your data

xx_1_wine, xx_2_wine = ..., ...

# Then plot your boundaries

plot_wine_2d_boundaries( ... )

#### Solution After you've attempted the above exercise, uncomment and run the code below.

In [56]:

# %load '../solutions/breakout_2_sol.py'

6. Fit regularized polynomial logistic regression models and examine the results¶

▲ Return to contents

**What do you expect to happen to our decision boundaries if we apply regularization to our polynomial regression model?**

In [57]:

%%time

f, ax = plt.subplots(1, 3, sharey=True, figsize=(6*3, 6))

model_logistics =[]
model_logistics_test_accs_scores =[]
model_logistics_train_accs_scores =[]

for test, C in enumerate([10000, 100, 1]):
    model_logistics.append(LogisticRegression(C=C).fit(X_train_poly, y_train))
    
    y_pred_train = model_logistics[test].predict(X_train_poly)
    y_pred_test = model_logistics[test].predict(X_test_poly)
    model_logistics_train_accs_scores.append(accuracy_score(y_train, y_pred_train)*100)
    model_logistics_test_accs_scores.append(accuracy_score(y_test, y_pred_test)*100)

    Z = model_logistics[test].predict(X_mesh_poly)
    Z = Z.reshape(xx_1.shape)

    ax[test].contourf(xx_1, xx_2, Z, alpha=0.3, colors=colors, levels=2)
    plot_points(ax[test])

    ax[test].set_title('Classes of Flowers, with C = '+ str(C), fontsize=16)
    ax[test].set_xlabel('sepal width (cm)', fontsize=12)
    
    if test==0:
        ax[test].legend(loc='upper left', ncol=1, fontsize=12)
        ax[test].set_ylabel('sepal length (cm)', fontsize=12)
    
plt.tight_layout()
plt.show()

CPU times: user 2.53 s, sys: 1.49 s, total: 4.02 s
Wall time: 1.4 s

What do you observe?

How are the decision boundaries looking?
What happens when the regularization term C changes?
You may want to look at the documentation of sklearn.linear.LogisticRegression() to see how the C argument works:

In [58]:

# To get the documentation uncomment and run the following command:

# LogisticRegression?

What do expect regarding the evolution of the norm of the coefficients of our models when the regularization term C changes?

Our list contains all 3 models with different values for C (take a look at the first parameter within brackets)

In [59]:

model_logistics

Out[59]:

[LogisticRegression(C=10000, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False)]

In [60]:

for test, model in enumerate(model_logistics):
    print('\nRegularization parameter : \tC = {}'.format(model.C))
    print("Training Set Accuracy : \t{}".format(model_logistics_train_accs_scores[test])+'%')
    print("Testing Set Accuracy : \t\t{}".format(model_logistics_test_accs_scores[test])+'%')
    print('Mean absolute coeficient : \t{:0.2f}'.format(np.mean(np.abs(model.coef_))))

Regularization parameter : 	C = 10000
Training Set Accuracy : 	82.0%
Testing Set Accuracy : 		78.0%
Mean absolute coeficient : 	5.93

Regularization parameter : 	C = 100
Training Set Accuracy : 	82.0%
Testing Set Accuracy : 		78.0%
Mean absolute coeficient : 	2.14

Regularization parameter : 	C = 1
Training Set Accuracy : 	83.0%
Testing Set Accuracy : 		80.0%
Mean absolute coeficient : 	0.38

Interpretation of Results: What happens when our Regularization Parameter decreases?

The amount of regularization increases, and this results in:

The training set accuracy decreasing a little bit (not much of a problem)

The TEST Accuracy INCREASING a little bit (better generalization!)

The size of our coefficents DECREASES on average.

Breakout Exercise #3¶

▲ Return to contents

Tune and fit a Lasso regularized model using cross-validation.

Perform Lasso regularized logistic regression and choose an appropriate C by using cross-validation

Confirm whether any predictors are identified as unimportant (i.e. $w_i=0$)

There are a number of different tools built into sklearn that help you to perform cross-validation.

A lot of examples in this class so far have used sklearn.model_selection.cross_validate as the primary means for peforming cross-validation.

BUT WAIT!!! As it turns out, sklearn provides a very useful tool for performing logistic regression cross-validation across a range of regularization hyperparameters.

For this problem, you should use the sklearn.linear_model.LogisticRegressionCV.

### Student Exercise: 1. Add a new predictor to your `X` data measuring the interaction between `alcohol` and `flavanoid` (make certain to do this for both your training and TEST `X` dataframes). 2. Please review the documentation to learn how to fit and use the `LogisticRegressionCV` model object: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html 3. Using `LogisticRegressionCV` to fit your model, perfrom cross-validation with `3` k-folds, Lasso-like regularization, and the following list of regularization parameters `[1e-4,1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4]`. 4. Print (1) the regularization parameter chosen by the model, (2) your train and test accuracies, and (3) your model coefficients (including the intercept).

In [61]:

## Your code here.

#### Solution After you've attempted the above exercise, uncomment and run the code below.

In [62]:

# %load '../solutions/breakout_3_sol.py'

## What did we learn today? - Linear regression does not work well with discrete features - Logistic regression is the way to go: outputs are clipped from 0 to 1 - Multiclass? Can still use logistic regression! Separate probabilities for each class - total P is 1 - If just have two features, put them on x and y axis and plot data, and can easily draw decision boundary - How to use meshgrid and contourf - What happens with polyomial features in logistic regression: nonlinear decision boundaries - Impact of regularization argument for logistic regession -------

END OF STANDARD SECTION¶

7. kNN Classification¶

▲ Return to contents

In this section, we will fit a kNN-classification model, plot our decision boundaries, and interpret the results.

You can read the KNeighborsClassifier documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [63]:

#Training
model_KNN_classifier = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)

In [64]:

#Predict
y_pred_train = model_KNN_classifier.predict(X_train)
y_pred_test = model_KNN_classifier.predict(X_test)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

Training Set Accuracy: 96.0%
Testing Set Accuracy: 72.0%

The fact we have a big gap of performance between the test and training set means we are overfitting. This can be explained by our choice to use n_neighbors=1.

- **Based on your knowledge of kNN-regression can you guess how the decision boundary of the kNN-classification will look when using "n_neighbors=1"?** - **How about if we were to increase that number to "n_neighbors=50" or "n_neighbors=100" (our total number of training observations)?**

Please note: The below code cell will take several minutes to run, primarily because of the large number of values in our meshgrid against which each kNN model must generate predictions.

What does this tell us about the efficiency of the kNN algorithm vs. regularized logistic regression?

Notice that the comparable set of 3 subplots generated using logistic regression only took about 1 second to run.

In [65]:

%%time

# Similar to what we did above for varying values C
# let's explore the effect that k in kNN classification has on our decision boundaries
ks = [1, 50, 100]

X_mesh = np.stack((xx_1.ravel(), xx_2.ravel()),axis=1)
Z = model_logistic.predict(X_mesh)
Z = Z.reshape(xx_1.shape)

f, ax = plt.subplots(1, 3, sharey=True, figsize=(6*3, 6))

model_ks = []
model_ks_test_accs_scores = []
model_ks_train_accs_scores = []

for test, k in enumerate(ks):
    model_ks.append(KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train))
    
    y_pred_train = model_ks[test].predict(X_train)
    y_pred_test = model_ks[test].predict(X_test)
    model_ks_train_accs_scores.append(accuracy_score(y_train, y_pred_train)*100)
    model_ks_test_accs_scores.append(accuracy_score(y_test, y_pred_test)*100)

    Z = model_ks[test].predict(X_mesh)
    Z = Z.reshape(xx_1.shape)

    ax[test].contourf(xx_1, xx_2, Z, alpha=0.3, colors=colors, levels=2)
    plot_points(ax[test])

    ax[test].set_title('Classes of Flowers, with k = '+ str(k), fontsize=16)
    ax[test].set_xlabel('sepal width (cm)', fontsize=12)
    
    if test==0:
        ax[test].legend(loc='upper left', ncol=1, fontsize=12)
        ax[test].set_ylabel('sepal length (cm)', fontsize=12)
    
plt.tight_layout()
plt.show()

CPU times: user 2min 48s, sys: 9.78 s, total: 2min 58s
Wall time: 2min 51s

- **What do you notice in the plots above in regards to our decision boundaries as $k$ increases?** - **What implications might this have for our Train vs. Test accuracies at each value $k$?** - **If we wanted to tune our kNN classification model to find the best value $k$ given our data, what approach should we take?** (You better say "cross-validation" with many values $k$!)

8. Introducing `sklearn` pipelines (optional)¶

▲ Return to contents

Pipelines can be used to to sequentially apply a list of transforms (e.g. scaling, polynomial feature creation) and a final estimator.

Instead of manually building our polynomial features which might take a lot of lines of code we can use a pipeline to sequentially create polynomials before fitting our logistic regression. Scaling can also be done inside the make_pipeline.

The documentation: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

Previously we did:

X_train_poly_cst=X_train_cst.copy()
X_train_poly_cst['sepal width (cm)^2'] = X_train_cst['sepal width (cm)']**2
X_train_poly_cst['sepal length (cm)^2'] = X_train_cst['sepal length (cm)']**2

X_test_poly_cst=X_test_cst.copy()
X_test_poly_cst['sepal width (cm)^2'] = X_test_poly_cst['sepal width (cm)']**2
X_test_poly_cst['sepal length (cm)^2'] = X_test_poly_cst['sepal length (cm)']**2

Now it is a one-liner:

make_pipeline(PolynomialFeatures(degree=2, include_bias=False), LogisticRegression())

In [66]:

# your code here
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_logreg_estimator = make_pipeline(
    PolynomialFeatures(degree=2,include_bias=False),
    LogisticRegression(),
)

#Training
polynomial_logreg_estimator.fit(X_train_poly, y_train)

#Predict
y_pred_train = polynomial_logreg_estimator.predict(X_train_poly)
y_pred_test = polynomial_logreg_estimator.predict(X_test_poly)

#Perfromance Evaluation
train_score = accuracy_score(y_train, y_pred_train)*100
test_score = accuracy_score(y_test, y_pred_test)*100

print("Training Set Accuracy:",str(train_score)+'%')
print("Testing Set Accuracy:",str(test_score)+'%')

Training Set Accuracy: 80.0%
Testing Set Accuracy: 74.0%

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

CS109A Introduction to Data Science

Standard Section 5: Logistic Regression, Multiple Logistic Regression, and kNN Classification¶

Contents¶

In this section we will be covering Linear Regression, Logistic Regression, and kNN Classification using the Iris dataset.¶

1. Import and explore data¶

What our the target and features?¶

Explore Data¶

Train-Test Split¶

2. Fit a LINEAR regression model for classification, understand drawbacks, and interpret results¶

Get Performance by Class (Lookup Confusion Matrix)¶

3. Fit a simple LOGISTIC regression model for classification, compare performance, and interpret results¶

Let's compare logistic regression against linear regression predictions¶

How does our Logistic Regression come up with mutiple class predictions?¶

Get Performance by Class (Lookup Confusion Matrix)¶

Breakout Session #1¶

4. Visualize Predictions and Decision boundaries¶

Plotting the decision boundary by using the following functions:¶

What is meshgrid? A quick numpy dive...¶

Now, with our numpy review over, let's get back to plotting our boundaries!¶

Why are the decision boundaries of this Logistic Regression linear?¶

5. Fit a higher order polynomial logistic regression model for classification, compare performance, plot decision boundaries, and interpret results¶

Decision boundaries for higher order logistic regression¶

Breakout Session #2¶

6. Fit regularized polynomial logistic regression models and examine the results¶

Breakout Exercise #3¶

END OF STANDARD SECTION¶

7. kNN Classification¶

8. Introducing sklearn pipelines (optional)¶

Now, with our `numpy` review over, let's get back to plotting our boundaries!¶

8. Introducing `sklearn` pipelines (optional)¶

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2