CS-109A Introduction to Data Science

Lab 6: Logistic Regression

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Will Claybaugh, David Sondak, Chris Tanner


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Goals

In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.

By the end of this lab, you should:

  • Be familiar with the sklearn implementations of
    • Linear Regression
    • Logistic Regression
  • Be able to make an informed choice of model based on the data at hand
  • (Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier
  • (Bonus) Apply weights to each class in the model to achieve your desired tradeoffs between discovery and false alarm in various classes
In [2]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

from sklearn.model_selection import train_test_split

Part 1: The Wine Dataset

The dataset contains 11 chemical features of various wines, along with experts' rating of that wine's quality. The quality scale technically runs from 1-10, but only 3-9 are actually used in the data.

Our goal will be to distinguish good wines from bad wines based on their chemical properties.

Read-in and checking

We do the usual read-in and verification of the data:

In [5]:
wines_df = pd.read_csv("../data/wines.csv", index_col=0)

wines_df.head()
Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red good
0 8.9 0.590 0.50 2.0 0.337 27.0 81.0 0.99640 3.04 1.61 9.5 6 1 0
1 7.7 0.690 0.22 1.9 0.084 18.0 94.0 0.99610 3.31 0.48 9.5 5 1 0
2 8.8 0.685 0.26 1.6 0.088 16.0 23.0 0.99694 3.32 0.47 9.4 5 1 0
3 11.4 0.460 0.50 2.7 0.122 4.0 17.0 1.00060 3.13 0.70 10.2 5 1 0
4 8.8 0.240 0.54 2.5 0.083 25.0 57.0 0.99830 3.39 0.54 9.2 5 1 0
In [6]:
wines_df.describe()
Out[6]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality red good
count 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000
mean 7.558400 0.397455 0.30676 4.489250 0.067218 25.29650 91.03100 0.995351 3.251980 0.572990 10.489433 5.796000 0.50000 0.189000
std 1.559455 0.189923 0.16783 4.112419 0.046931 17.06237 59.57269 0.002850 0.164416 0.169583 1.151195 0.844451 0.50025 0.391705
min 3.800000 0.080000 0.00000 0.800000 0.009000 1.00000 6.00000 0.987400 2.740000 0.280000 8.500000 3.000000 0.00000 0.000000
25% 6.500000 0.260000 0.22000 1.800000 0.042000 12.00000 37.75000 0.993480 3.140000 0.460000 9.500000 5.000000 0.00000 0.000000
50% 7.200000 0.340000 0.30000 2.400000 0.060000 22.00000 86.00000 0.995690 3.240000 0.550000 10.300000 6.000000 0.50000 0.000000
75% 8.200000 0.520000 0.40000 6.100000 0.080000 35.00000 135.00000 0.997400 3.360000 0.650000 11.300000 6.000000 1.00000 0.000000
max 15.500000 1.580000 1.00000 26.050000 0.611000 131.00000 313.00000 1.003690 3.900000 2.000000 14.000000 8.000000 1.00000 1.000000

Building the training/test data

As usual, we split the data before we begin our analysis.

Today, we take the 'quality' variable as our target. There's a debate to be had about the best way to handle this variable. It has 10 categories (1-10), though only 3-9 are used. While the variable is definitely ordinal- we can put the categories in an order everyone agrees on- the variable probably isn't a simple numeric feature; it's not clear whether the gap between a 5 and a 6 wine is the same as the gap between an 8 and a 9.

Ordinal regression is one possibility for our analysis (beyond the scope of this course), but we'll view the quality variable as categorical. Further, we'll simplify it down to 'good' and 'bad' wines (quality at or above 7, and quality at or below 6, respectively). This binary column already exists in the data, under the name 'good'.

In [7]:
wines_train, wines_test = train_test_split(wines_df, test_size=0.2, random_state=8, stratify=wines_df['good'])

x_train = wines_train.drop(['quality','good'], axis=1)
y_train = wines_train['good']

x_test = wines_test.drop(['quality','good'], axis=1)
y_test = wines_test['good']

x_train.head()
Out[7]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol red
744 7.6 0.30 0.37 1.6 0.087 27.0 177.0 0.99438 3.09 0.50 9.8 0
51 7.6 0.29 0.49 2.7 0.092 25.0 60.0 0.99710 3.31 0.61 10.1 1
213 13.2 0.46 0.52 2.2 0.071 12.0 35.0 1.00060 3.10 0.56 9.0 1
883 8.6 0.33 0.34 11.8 0.059 42.0 240.0 0.99882 3.17 0.52 10.0 0
98 7.7 0.41 0.76 1.8 0.611 8.0 45.0 0.99680 3.06 1.26 9.4 1

Now that we've split, let's explore some patterns in the data

In [8]:
from pandas.plotting import scatter_matrix

scatter_matrix(wines_train, figsize=(30,20));