Key Word(s): logistic regression, linear regression, mle
CS-109A Introduction to Data Science
Lab 6: Logistic Regression¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Will Claybaugh, David Sondak, Chris Tanner
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Learning Goals¶
In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.
By the end of this lab, you should:
- Be familiar with the
sklearn
implementations of- Linear Regression
- Logistic Regression
- Be able to make an informed choice of model based on the data at hand
- (Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier
- (Bonus) Apply weights to each class in the model to achieve your desired tradeoffs between discovery and false alarm in various classes
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
from sklearn.model_selection import train_test_split
Part 1: The Wine Dataset¶
The dataset contains 11 chemical features of various wines, along with experts' rating of that wine's quality. The quality scale technically runs from 1-10, but only 3-9 are actually used in the data.
Our goal will be to distinguish good wines from bad wines based on their chemical properties.
Read-in and checking¶
We do the usual read-in and verification of the data:
wines_df = pd.read_csv("../data/wines.csv", index_col=0)
wines_df.head()
wines_df.describe()
Building the training/test data¶
As usual, we split the data before we begin our analysis.
Today, we take the 'quality' variable as our target. There's a debate to be had about the best way to handle this variable. It has 10 categories (1-10), though only 3-9 are used. While the variable is definitely ordinal- we can put the categories in an order everyone agrees on- the variable probably isn't a simple numeric feature; it's not clear whether the gap between a 5 and a 6 wine is the same as the gap between an 8 and a 9.
Ordinal regression is one possibility for our analysis (beyond the scope of this course), but we'll view the quality variable as categorical. Further, we'll simplify it down to 'good' and 'bad' wines (quality at or above 7, and quality at or below 6, respectively). This binary column already exists in the data, under the name 'good'.
wines_train, wines_test = train_test_split(wines_df, test_size=0.2, random_state=8, stratify=wines_df['good'])
x_train = wines_train.drop(['quality','good'], axis=1)
y_train = wines_train['good']
x_test = wines_test.drop(['quality','good'], axis=1)
y_test = wines_test['good']
x_train.head()
Now that we've split, let's explore some patterns in the data
from pandas.plotting import scatter_matrix
scatter_matrix(wines_train, figsize=(30,20));
It looks like there aren't any particularly strong correlations among the predictors (maybe sulfur dioxide and free sulfur dioxide) so we're safe to keep them all. It also looks like the different quality categories have roughly the same distribution of most variables, with volatile/fixed acidity and alcohol seeming like the most useful predictors.
Part 2 (Introduction): Binary Logistic Regression¶
Linear regression is usually a good baseline model, but since the outcome we're trying to predict only takes values 0 and 1 we'll want to use logistic regression instead of basic linear regression.
We'll begin with statsmodels
, because cs109
likes confidence intervals and checking that coefficients make sense.
import statsmodels.api as sm
sm_fitted_logit = sm.Logit(y_train, sm.add_constant(x_train)).fit()
#sm_fitted_logit.summary() ### ORIGINAL VERSION. GAVE AttributeError: module 'scipy.stats' has no attribute 'chisqprob'
sm_fitted_logit.summary2() ### WORKS
Let's talk about the output:
First, "optimization terminated successfully". Recall that linear regression and its simple formula for the optimal betas is a rarity in machine learning and statistics: most models are fit to the data algorithmically, not via a formula. This message is letting us know that the algorithm seems to have worked.
Second, the pseudo $R^2$ is rather low (.23). As with regular $R^2$, we might take this as a sign that the model is struggling.
Finally, let's look at the coefficients.
- Several of the coefficients are statistically significant, including
- Fixed acidity - good
- Volatile Acidity - bad
- Residual Sugar - good (judge have a sweet tooth?)
- Chlorides - bad
- Sulphates - good
- Alcohol - good (judges like getting drunk?)
- The rest only reach a coefficient size we would often observe by chance alone, without any actual effect from the predictor
More formal interpretations are of coefficients are long-winded. "A one unit increase in alcohol (holding all else constant) results in a predicted 0.494 increase in the log odds of a wine being classified as good".
We can't be more precise because the effect of one unit of alcohol depends on how much alcohol there already is. The one unit increase/decrease matters more if the wine is otherwise on the border between good and bad. If the wine is undrinkable (in the far left tail of the sigmoidal curve) one unit of alcohol barely moves the probability, while if the wine is in the middle of the curve that unit of acidity has much more practical impact.
- Are there any bones you'd like to pick with the model I've laid out? Can you think of a better logistic regression model?
Prediction¶
One of the really cool features of logistic regression is that it hands back probabilities of a given case being 1 or 0, rather than just 1s and 0s. That lets us do neat things like set different cutoffs for what counts as a 1 and do ROC analysis and so on. Here, we'll just set the cutoff at 0.5: if a 1 is reported as more likely, predict a 1. (You can play with the cutoff yourself and see if you can make the model do better by trading false positives and false negatives).
Because this is statsmodels, we'll need to import a tool or do the test set score calculation ourselves. Here, it's easy enough to implement:
- do the predictions
- compare with .5
- find out what percentage of our binary predictions matched the truth
sm_binary_prediction = sm_fitted_logit.predict(sm.add_constant(x_test)) >= .5
np.sum(y_test == sm_binary_prediction) / len(y_test)
Wow! 80% is a pretty good performance! We can pretty much tell the bad wines from the good.
Here's a sanity check:
np.sum(y_test == 0) / len(y_test)
Oh... no... wait. A model that says "all wines are bad" also scores 80% on the test set. Our fancy model isn't really doing that well.
Moral of the story: Before you congratulate a model, think of a truly trivial model to compare it to.
- Re-create the results above but this time work with
sklearn
. Use theLogisticRegression
class. Follow the usual.fit
,.score
procedure. To matchstatsmodel
's coefficient values (roughly), you will need to adjust the input parameters:C
solver
- One other parameter
- See the sklearn documentation
Hint: statsmodels
uses a Newton-Raphson method to optimize the beta values.
from sklearn.linear_model import LogisticRegression
print("target:\n{}".format(sm_fitted_logit.params))
#
#fitted_lr = LogisticRegression(C=___, solver=___, ___)
Answer:
# your code here
from sklearn.linear_model import LogisticRegression
fitted_lr = LogisticRegression(C=1000000, solver='newton-cg', max_iter=250).fit(x_train,y_train)
print(fitted_lr.coef_)
print("Test set score:", fitted_lr.score(x_test,y_test))
# uncoment me and execute me - this will erase your cell ...
#%load solutions/sklearn_logistic.py
Speaker note: When presenting solution, model reading the documentation from the webpage. How does one know where to look?
Speaker note: Mention the wide variety of solvers and how (some) use different levels of derivatives to converge in fewer steps
The Decision Boundary¶
One powerful way to think about classification models is to consider where and how they draw the line between predicting "class A" and "class B". The code below lets you play with a 2d logistic regression. Points towards yellow will be predicted as 1s, points towards violet will be predicted as 0s.
from scipy.special import expit
def plot_logistic_contour(beta0, betaX, betaY, betaXY=0, betaX2=0, betaY2=0):
delta=.1
x_values = np.arange(-3.0, 3.0, delta)
y_values = np.arange(-3.0, 3.0, delta)
x_grid, y_grid = np.meshgrid(x_values, y_values)
logistic_output = expit(beta0 + betaX*x_grid + betaY*y_grid
+ betaXY*x_grid*y_grid + betaX2*x_grid**2 + betaY2*y_grid**2)
contour_figure = plt.contour(x_grid, y_grid, logistic_output)
plt.clabel(contour_figure, inline=1, fontsize=10);
plt.xlim(-3,3)
plt.ylim(-3,3)
plt.show()
#plot_logistic_contour(beta0=1, betaX=2, betaY=3, betaXY=0, betaY2=.1)
# Use this cell to experiment
plot_logistic_contour(beta0=1, betaX=2, betaY=3)
- What happens to the decision boundary as the coefficient on X increases?
- What happens if you increase the Y coefficient to match?
- What does the constant term control?
- What impact do higher-order and interaction terms have on the boundary?
- What parameter settings should I show the class?
Answers:
your answer here
- The boundary tips towards vertical
- The boundary is in the same place as it was originally, but is squished together. The model is much more certain about how to predict points a given distance from the boundary
- It shifts the boundary, perpendicular to its current orientation
- Including squared terms allows quadratic decision boundaries, and the interraction term allows hyperbolic boundaries
# %load solutions/boundaries.txt
Part 3 (The Real Challenge): Multiclass Classification¶
Before we move on, let's consider a more common use case of logistic regression: predicting not just a binary variable, but what level a categorical variable will take. Instead of breaking the quality variable into 'good' and 'other', let's discretize into 'good, 'medium', and 'bad'.
# copy the original data so that we're free to make changes
wines_df_recode = wines_df.copy()
# use the 'cut' function to reduce a variable down to particular bins. Here the lowest bin is 0-4, next is 5-7,
# and the last is 7-10
wines_df_recode['quality'] = pd.cut(wines_df_recode['quality'],[0,4,7,10], labels=[0,1,2])
# drop the un-needed columns
x_data = wines_df_recode.drop(['quality','good'], axis=1)
y_data = wines_df_recode['quality']
x_train,x_test, y_train,y_test = train_test_split(x_data, y_data, test_size=.2, random_state=8, stratify=y_data)
print(wines_df['quality'].head())
print(wines_df_recode['quality'].head())
The cut
function obviously stores a lot of extra information for us. It's a very useful tool for discretizing an existing variable.
- Adapt your earlier logistic regression code to fit to the new training data. What is stored in
.coef_
and.intercept_
? - How well does this model predict the test data?
- Put the model's performance in context. Think of a trivial model to compare to, and provide its accuracy score on the test set.
Answers:
1.
# your code here
from sklearn.linear_model import LogisticRegression
fitted_lr = LogisticRegression(C=1000000, solver='newton-cg', max_iter=250).fit(x_train,y_train)
print("Coefficients:")
print(fitted_lr.coef_)
print("Intercepts:")
print(fitted_lr.intercept_)
# %load solutions/multi_logistic.py
your answer here
1. We get three sets of coefficients, and three intercepts.
We need three sets because (under the default 'one versus rest' strategy) we fit three models. When predicting, model 1 reports a probability of the new example coming from class A or from the cloud of remaining classes. Model 2 reports the probability of whether the example comes from class B or the cloud of remaining classes, and so on. We take this set of scores and pick the biggest one (we classify as whichever class has the biggest ratio of "this class" to "not this class").
# %load solutions/multi_logistic.txt
2.
# your code here
fitted_lr.score(x_test,y_test)
# %load solutions/score1.py
your answer here
2. The model does pretty well at predicting the test data...
3.
# make a dumb prediction that always guesses 1, the most common class
# your code here
dumb_prediction = np.ones(len(y_test))
np.sum(y_test == dumb_prediction) / len(y_test)
# %load solutions/trivial_model.py
your solution here
But, a trivial model that guesses the most likely class also does really well on the test set, too.
# %load solutions/3.3.txt
Summary¶
- Logistic regression extends OLS to work naturally with a dependent variable that's only ever 0 and 1.
- In fact, Logistic regression is even more general and is used for predicting the probability of an example belonging to each of $N$ classes.
- The code for the two cases is identical and just like
LinearRegression
:.fit
,.score
, and all the rest - Significant predictors does not imply that the model actually works well. Signifigance can be driven by data size alone.
- The data aren't enough to do what we want
Warning: Logistic regression tries to hand back valid probabilities. As with all models, you can't trust the results until you validate them- if you're going to use raw probabilities instead of just predicted class, take the time to verify that if you pool all cases where the model says "I'm 30% confident it's class A" 30% of them actually are class A.
Part 4: Dimensionality Reduction¶
Our models are clearly struggling, but it's hard to tell why. Let's PCA to shrink the problem down to 2d (with as little loss as possible) and see if that gives us a clue about what makes this problem tough.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# scale the datasets
scale_transformer = StandardScaler(copy=True).fit(x_train)
x_train_scaled = scale_transformer.transform(x_train)
x_test_scaled = scale_transformer.transform(x_test)
# reduce dimensions
pca_transformer = PCA(2).fit(x_train_scaled)
x_train_2d = pca_transformer.transform(x_train_scaled)
x_test_2d = pca_transformer.transform(x_test_scaled)
print(x_train_2d.shape)
x_train_2d[0:5,:]
Some comments:
- Both scaling and reducing dimension follow the same pattern: we fit the object to the training data, then use .transform to convert the training and test data. This ensures that, for instance, we scale the test data using the training mean and variance, not its own mean and variance
- We need to equalize the variance of each feature before applying PCA, otherwise certain dimensions will dominate the scaling: our PCA dimensions would just be the features with the largest spread.
## plot each group
# notice that we set up lists to track each group's plotting color and label
colors = ['r','c','b']
label_text = ["Bad Wines", "Medium Wines", "Good Wines"]
# and we loop over the different groups
for cur_quality in [0,1,2]:
cur_df = x_train_2d[y_train==cur_quality]
plt.scatter(cur_df[:,0], cur_df[:,1], c = colors[cur_quality], label=label_text[cur_quality])
# all plots need labels
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimention 2")
plt.legend();
Well, that gives us some idea of why the problem is difficult: the good wines and bad wines are hiding right among the average wines. It does look like the wines separate into two groups, though, possibly one for reds and one for whites.
- What critique can you make against the plot above? Why does this plot not prove that the different wines are hopelessly similar?
- The wine data we've used so far consist entirely of continuous predictors. Would PCA work with categorical data?
Answer:
your answer here
- The PCA dimensions are chosen without regard to the y variable. Thus it is possible that the very next PCA dimension will lift the red points up out of the page, push the blue points down into it, and leave the cyan points where they are; such a dimension would separate the different types of wine and make classification easy.
- PCA would not work with categorical data. PCA requires there to be a meaningful notion of distance between points. Categorical or ordinal data is not enough.
# %load solutions/4.txt
- Edit the code above to plot the locations of red wines and white wines
Answer:
# your code here
## plot each group
# notice that we set up lists to track each group's plotting color and label
colors = ['r','c','b']
label_text = ["Reds", "Whites"]
# and we loop over the different groups
for cur_color in [0,1]:
cur_df = x_train_2d[x_train['red']==cur_color]
plt.scatter(cur_df[:,0], cur_df[:,1], c = colors[cur_color], label=label_text[cur_color])
# all plots need labels
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimention 2")
plt.legend();
Evaluating PCA - Variance Explained¶
One of the criticisms we made of the PCA plot was that it's lost something from the original data.
Let's actually investigate how much of the original data's structure the 2d PCA captures. We'll look at the explained_variance_ratio_
portion of the PCA fit. This lists, in order, the percentage of the x data's total variance that is captured by the nth PCA dimension.
var_explained = pca_transformer.explained_variance_ratio_
print("Variance explained by each PCA component:", var_explained)
print("Total Variance Explained:", np.sum(var_explained))
The first PCA dimension captures 33% of the variance in the data, and the second PCA dimension adds another 20%. Together, we've got about half of the total variation in the training data covered with just these two dimensions.
- Fit a PCA that finds the first 10 PCA components of our training data
- Use
np.cumsum
to print out the variance we'd be able to explain by using n PCA dimensions for n=1 through 10 - Does the 10-dimension PCA agree with the 2d PCA on how much variance the first components explain? Do the 10d and 2d PCAs find the same first two dimensions? Why or why not?
- Make a plot of number of PCA dimensions against total variance explained. What PCA dimension looks good to you?
Hint: np.cumsum
stands for 'cumulative sum', so np.cumsum([1,3,2,-1,2])
is [1,4,6,5,7]
Answer:
#your code here
pca_10_transformer = PCA(10).fit(x_train_scaled)
pca_10_transformer
np.cumsum(pca_10_transformer.explained_variance_ratio_)
3.
your answer here
The 10d PCA and the 2d PCA agree about how much variance the first two components explain. The 10d and 2d PCA give the same components in the same order. This means it's safe to simply fit a PCA with the largest number of components you expect you will need, and take a subset as appropriate.
# %load solutions/6.3.txt
4.
#your code here
plt.scatter(range(1,11),np.cumsum(pca_10_transformer.explained_variance_ratio_))
plt.xlabel("PCA Dimension")
plt.ylabel("Total Variance Captured")
plt.title("Variance Explained by PCA");
# %load solutions/6.4.py
A PCA dimension of 3, 4, or 5 looks good to me. These values are roughly where we hit diminishing returns on variance explained.
Plots like the one above are called 'Scree' or 'Elbow' plots. They are often used to heuristically select a good number of PCA dimensions.
Summary¶
- PCA maps a high-dimensional space into a lower dimensional space.
- The PCA dimensions are ordered by how much of the original data's variance they capture
- There are other cool and useful properties of the PCA dimensions (orthogonal, etc.). See a textbook.
- PCA on a given dataset always gives the same dimensions in the same order.
- You can select the number of dimensions by fitting a big PCA and examining a plot of the cumulative variance explained.
Part 5: Did we fail?¶
None of the models worked, and we can't tell good wines from bad. Was it all a waste of time and money?
Not really. All analyses are a roll of the dice. Some analyses fail, like this one did, becuase the data at hand just don't support the task we've set out.
What can we do about it?
- Be honest about the methods and the null result. Lots of analyses fail.
- Collect a dataset you think has a better chance of success. Maybe we collected the wrong chemical signals...
- Keep trying new approaches. Just beware of overfitting the data you're validating on. Always have a test set locked away for when the final model is built.
- Change the question. Maybe something you noticed during analysis seems interesting or useful (classifying red versus white). But again, you the more you try, the more you might overfit, so have test data locked away.
- Just move on. If the odds of success start to seem small, maybe you need a new project.
The Moral of the Lab¶
- Sometimes, the data just aren't enough to adequately predict outcomes.
- In this lab we saw that no amount of modeling finesse would let us use a wine's chemical properties to tell good wines and bad wines from mediocre ones.
- The chemical properties were very good at telling red wines from whites, however.
- PCA helped us visualize the data and confirm that the highly rated wines just aren't chemically distinct from the other wines.
- NOT ALL ANALYSES YIELD USEFUL RESULTS Sometimes (arguably most of the time), the data aren't suitable for a task or just don't have anything interesting to say.
Part 6 (Sidebar): Pipelines¶
Remember when we were trying to adapt our LDA model to run on the training data with 'red' dropped? We had to invent new variable names and define functions and it was generally much harder than it needed to be. Pipelines are sklearn
's tool for packaging an entire analysis together into a single object. This enables convenient inspection, saving, deployment, and (yes) cross validation of the model.
Let's look at an example (we'll switch the model to KNN to justify some later analysis).
from sklearn.pipeline import Pipeline
knn_pipeline = Pipeline(
[
('scaling', StandardScaler()), # scale all columns
('dim_reduction', PCA()), # PCA to reduce dimension
('model', KNeighborsClassifier()) # KNN to predict
]
)
# run with default settings ()
knn_pipeline.fit(x_train, y_train)
print("Test set score (default parameters)", knn_pipeline.score(x_test, y_test))
# particular sub-component settings are accessed with the component name, two
# underscores, and the parameter name
knn_pipeline.set_params(dim_reduction__n_components = 2, model__n_neighbors = 5)
knn_pipeline.fit(x_train, y_train)
print("Test set score (updated parameters)", knn_pipeline.score(x_test, y_test))
There's also a convenience function make_pipeline
that lets us skip naming the different steps. Notice the default names are all-lowercase versions of the class names (standardscaler, pca, kneighborsclassifier)
from sklearn.pipeline import make_pipeline
knn_pipeline = make_pipeline(StandardScaler(), PCA(), KNeighborsClassifier())
knn_pipeline
It's easy to run the whole modelling process on new data:
red_model = knn_pipeline.fit(x_train.drop('red', axis=1), x_train['red'])
red_model.score(x_test.drop('red', axis=1), x_test['red'])
As promised, cross validation tools work directly with the pipeline object.
from sklearn.model_selection import cross_val_score
cross_val_score(knn_pipeline, x_train, y_train, cv=3)
from sklearn.model_selection import GridSearchCV
search_dict = {
'pca__n_components': [3,5,10],
'kneighborsclassifier__n_neighbors': [1,2,3,4,5]
}
cv_results = GridSearchCV(knn_pipeline, search_dict, cv=3).fit(x_train, y_train)
cv_results.best_params_
Note: In general, more PCA components will work better for prediction. However, KNN often performs worse as dimension increases, meaning there may be a meaningful balance point between capturing more variance and a space small enough for KNN to work well.
Part 7 (Sidebar): Weighting the training points¶
Some models can accept weights on the training points to given them greater priority in the model's fitting process. This can be useful if, for instance, certain classes are rare but we want to be sure the model classifies them correctly (e.g. we're trying to classify cancers and one form is rare but very aggressive). In general, weighting training points is like moving along the ROC curve; we change some model parameters to alter the mistakes the model makes to be more in line with our tastes.
Let's see this in action with a logistic regression:
unweighted_lr = LogisticRegression(C=1000000).fit(x_train,y_train)
weight_dict = {0:100, 1:1, 2:100}
weighted_lr = LogisticRegression(C=1000000, class_weight=weight_dict).fit(x_train,y_train)
from sklearn.metrics import confusion_matrix
print("Rows: True Lables (Bad, Medium, Good), \nColummns: Predicted Lables (Bad, Medium, Good)")
print()
print("unweighted:")
print(confusion_matrix(y_test, unweighted_lr.predict(x_test)))
print("weighted:")
print(confusion_matrix(y_test, weighted_lr.predict(x_test)))
Without weighting, the model plays it safe and predicts that all of the test set wines are medium. With weighting, the model is told to care more about getting the bad and good wines right. The model does as we've asked and correctly IDs 3 good/bad test wines, at the price of 17 falsely bad wines and 16 falsely good wines. However, if identifying bad and good wines is, as implied, 100 times more important than identifying medium wines, we've made a really good trade.
- What happens if you give a weight of 0 to the medium wines?
- What weighting gives results that accord with your personal sense of what the model should be doing? How many actually-medium bottles is a single good bottle worth?
Answers:
- The model learns a classification rule that never predicts 'medium'. It's as it we dropped the medium wines from training.
- 100, 1, 100 looks the best to me. We get a 1-in-8 sucess rate on the wines flagged as good. However, I found these values by looking at the test set confusion matrix; it's not clear they'd maintain the 1-in-8 ratio on new data.