CS109A Introduction to Data Science
Standard Section 7: Multiclass Classification¶
Harvard University
Fall 2018
Section Leaders: Mehul Smriti Raje, Ken Arnold, Karan Motwani, Cecilia Garraffo
Instructors: Pavlos Protopapas, Kevin Rader
#RUN THIS CELL
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
from matplotlib.patches import Ellipse
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import confusion_matrix
from scipy import linalg
from scipy.stats import multivariate_normal
import itertools
import time
from itertools import combinations
from IPython.display import Image
from pandas.plotting import scatter_matrix
import seaborn as sns
sns.set(style='white')
This data is from the Sloan Digital Sky Server¶
The Sloan Digital Sky Survey is a project to make a map of a large part of the universe using a 2.5 meters telescope located at Apache Point Observatory in New Mexico.¶
Image('../fig/sdss.jpg')
*Image from SDSS
A few physics concepts for this section:¶
1) Blackbody Radiation:¶
All objects with a temperature above absolute zero (0 K, -273.15 degrees C) emit energy in the form of electromagnetic radiation. A blackbody is a theoretical or model body which absorbs all radiation falling on it, reflecting or transmitting none. It is a hypothetical object which is a “perfect” absorber and a “perfect” emitter of radiation over all wavelengths. The spectral distribution of the thermal energy radiated by a blackbody (i.e. the pattern of the intensity of the radiation over a range of wavelengths or frequencies) depends only on its temperature.¶
Based on this concept, astronomers use the distribution of measured colors in an object as a proxy for its temperature.¶
Image('../fig/blackbody_radn_curves.png')
*Image & text from Swinburne University of Technology (http://astronomy.swin.edu.au/cosmos/B/Blackbody+Radiation)
Image('../fig/redshift.png')
*Image from Wikipedia.
Because the universe is expanding, objects further from us will recede faster (think of a balloon being inflated and how the relative velocity of two points further away on its surface will be larger than that of two nearby points).¶
Image('../fig/expansion.jpeg')
Combining these two concepts it is possible to estimate the distance of a body based on the energy distribution of its emission.¶
Image('../fig/quasar.jpeg')
*Image from Wikipedia.
Data Preparation¶
Load and investigate¶
sky = pd.read_csv('data/skyserver2.csv')
display(sky.head())
display(sky.describe())
sky.info()
Transform predictors¶
Intensities to colors¶
The single-letter columns are the spectral energy in certain bands of wavelengths, so the differences between them tell us about color. Astronomers typically look at the following differences:
sky['u-g'] = sky['u'] - sky['g']
sky['g-r'] = sky['g'] - sky['r']
sky['r-i'] = sky['r'] - sky['i']
sky.head()
#Double check
max(abs(
(sky['u'] - sky['r']) -
(sky['u-g'] + sky['g-r'])))
Log-transform Redshift¶
Redshift makes much more sense in log-scale
plt.boxplot(sky.redshift, vert=False);
# Negative redshift is physically impossible, so the np.maximum clips everything below 0 to 0.
# log of 0 is -inf, so shift up by a small constant
sky['log_redshift'] = np.log(np.maximum(sky.redshift, 0) + 1e-6)
plt.boxplot(sky.log_redshift, vert=False);
Make target numeric¶
sky['class'].value_counts()
sky['class'].value_counts(normalize=True)
QSO is Quasars, representing about 10% of our data.
We'll want to represent our target classes numerically.
Exercise: Could we use a dummy variable encoding for our 3 targets?
First we make a mapping from names to ids and back:
tgt_id2name = ['GALAXY', 'STAR', 'QSO']
tgt_name2id = {name: idx for idx, name in enumerate(tgt_id2name)}
print("Class 1 is", tgt_id2name[1])
print("Galaxy is class", tgt_name2id['GALAXY'])
Now we use that mapping to make a new numeric column.
sky['response'] = [tgt_name2id[cls] for cls in sky['class']]
Separate data in train and test sets¶
data_train_raw, data_test_raw = train_test_split(sky, test_size=.3, random_state=2001)
len(data_train_raw), len(data_test_raw)
# This is the somewhat magic way:
data_train_raw.groupby('class').log_redshift.apply(lambda group: sns.kdeplot(group));
# Less magic, but less functional.
data_train_raw.groupby('class').log_redshift.hist()
Scale data¶
def scale_df(df, means, stds):
cols_to_scale = means.index
df = df.copy()
df[cols_to_scale] = (df[cols_to_scale] - means) / stds
return df
cols_to_scale = ['u-g', 'g-r', 'r-i']
train_means = data_train_raw[cols_to_scale].mean(axis=0)
train_stds = data_train_raw[cols_to_scale].std(axis=0)
data_train = scale_df(data_train_raw, train_means, train_stds)
data_test = scale_df(data_test_raw, train_means, train_stds)
# Check results.
data_train[cols_to_scale].describe()
Visualize!¶
def scatter_stars(ax, df, columns, class_labels, class_colors, s=5,
xlim=[-4, 4], ylim=[-4, 4], **kw):
for idx, (color, name) in enumerate(zip(class_colors, class_labels)):
subset = df[df['class'] == name]
ax.scatter(
subset[columns[0]], subset[columns[1]],
label=name,
c=color, s=s, **kw)
ax.set(xlabel=columns[0], ylabel=columns[1], xlim=xlim, ylim=ylim)
ax.legend()
data_sample = data_train.sample(frac=.05)
fig, ax = plt.subplots(1,1, figsize=(16,10))
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
# For overplotting demo, switch which of this line and the following is commented.
# class_labels=['GALAXY', 'STAR', 'QSO'],
class_labels=['STAR', 'GALAXY', 'QSO'],
class_colors=['r', 'g', 'b'], s=10, alpha=.5)
Caution: the overplotting between Star and Galaxy is deceptive.
Make classification task¶
predictors = ['u-g', 'g-r']
X_train = data_train.loc[:,predictors]
y_train = data_train.loc[:,'response']
X_test = data_test.loc[:,predictors]
y_test = data_test.loc[:,'response']
Multi-class Classification Approaches¶
When we look at multiclass classifiers, the overall accuracy (mean of y_true == y_pred) is interesting, but doesn't show us if we're making more mistakes on one class or another. It especially down-weights mistakes on minority classes (here, quasars). Mistakes on minority classes may be far more important than their frequency suggests (especially if we're dealing with people), so let's also look at confusion matrices.
# Code credit: sklearn example
def plot_confusion_matrix(ax, cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
# print("Normalized confusion matrix")
# else:
# print('Confusion matrix, without normalization')
img = ax.imshow(cm, interpolation='nearest', cmap=cmap)
plt.colorbar(img, ax=ax)
tick_marks = np.arange(len(classes))
ax.set(xticks=tick_marks, yticks=tick_marks)
ax.set_xticklabels(classes, rotation=45)
ax.set_yticklabels(classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
ax.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
ax.set(title=title, ylabel='True label', xlabel='Predicted label')
For example, if we always predict "star", we'd get a confusion matrix like:
fig, ax = plt.subplots()
constant_predictions = np.full(len(y_test), 1)
cnf_matrix = confusion_matrix(y_test, constant_predictions)
plot_confusion_matrix(ax, cnf_matrix, classes=tgt_id2name, normalize=True)
print("Accuracy:", np.mean(constant_predictions == y_test))
Here's our routine from hw7 for plotting decision boundaries:
def overlay_decision_boundary(ax, model, colors=None, nx=200, ny=200, desaturate=.5):
"""
A function that visualizes the decision boundaries of a classifier.
ax: Matplotlib Axes to plot on
model: Classifier (has a `.predict` method)
X: feature vectors
y: ground-truth classes
colors: list of colors to use. Use color colors[i] for class i.
nx, ny: number of mesh points to evaluated the classifier on
desaturate: how much to desaturate each of the colors (for better contrast with the sample points)
"""
# Create mesh
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()
xx, yy = np.meshgrid(
np.linspace(xmin, xmax, nx),
np.linspace(ymin, ymax, ny))
X = np.c_[xx.flatten(), yy.flatten()]
# Predict on mesh of points
model = getattr(model, 'predict', model)
y = model(X)
y = y.reshape((nx, ny))
# Generate colormap.
if colors is None:
colors = sns.utils.get_color_cycle()
y -= y.min() # If first class is not 0, shift.
assert np.max(y) <= len(colors)
colors = [sns.utils.desaturate(color, desaturate) for color in colors]
cmap = matplotlib.colors.ListedColormap(colors)
# Plot decision surface
ax.pcolormesh(xx, yy, y, zorder=-2, cmap=cmap, norm=matplotlib.colors.NoNorm(), vmin=0, vmax=y.max() + 1)
xx = xx.reshape(nx, ny)
yy = yy.reshape(nx, ny)
# ax.contourf(xx, yy, y, cmap=cmap, vmin=0, vmax=3)
ax.contour(xx, yy, y, colors="black", linewidths=1, zorder=-1)
Logistic Regression¶
Estimation time (it surprises me how fast this is... really?)
%time ovr = LogisticRegression(multi_class='ovr', solver='lbfgs').fit(X_train, y_train)
%timeit ovr = LogisticRegression(multi_class='ovr', solver='lbfgs').fit(X_train, y_train)
Inference time:
ovr = LogisticRegression(multi_class='ovr', solver='lbfgs').fit(X_train, y_train)
%timeit predictions = ovr.predict(X_test)
What's one-vs-rest doing?¶
One-vs-Rest fits 3 different classifiers: galaxy-vs-rest, star-vs-rest, and quasar-vs-rest. For example, the galaxy-vs-rest model is:
is_galaxy = (y_train == 0).astype(int)
galaxy_vs_rest = LogisticRegression(solver='saga').fit(X_train, is_galaxy)
Notice how the coefficients are exactly the same for this model...
galaxy_vs_rest.coef_, galaxy_vs_rest.intercept_
...and class 0 of the one-vs-rest model:
ovr.coef_[0], ovr.intercept_[0]
fig, ax = plt.subplots()
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.25, s=10)
overlay_decision_boundary(ax, galaxy_vs_rest, colors=class_colors, desaturate=.3)
CV for hyperparameter selection¶
The coefficients of Logistic Regression can be regularized by an L2 (magnitude) penalty. Let's use cross-validation to select a good penalty parameter.
%time lrm_ovr = LogisticRegressionCV(multi_class = 'ovr', cv=5).fit(X_train, y_train)
# We could also fit a Multinomial model, but that's beyond our scope here.
print('OvR Logistic Regression: accuracy on train={:.2%}, test={:.2%}'.format(
lrm_ovr.score(X_train, y_train), lrm_ovr.score(X_test, y_test)))
Why might CPU times be greater than wall time? Hint: your CPU probably has multiple cores...
fig, ax = plt.subplots()
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.25, s=10)
overlay_decision_boundary(ax, lrm_ovr, colors=class_colors, desaturate=.3)
k-NN¶
kNN estimation time:
%time knn = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
Inference time:
%time knn_predictions = knn.predict(X_test)
Let's use a sklearn meta-estimator to find the best value of k. (On homework you should use cross_val_score
; we're doing it a different way here so you can't just copy-paste code for one of the answers):
from sklearn.model_selection import GridSearchCV
kvals = [10, 50, 100]
t = time.time()
knn = GridSearchCV(
KNeighborsClassifier(),
{'n_neighbors': kvals},
cv=5).fit(X_train, y_train)
runtime_knn = time.time()-t
%timeit predictions = knn.predict(X_test)
print("Best k:", knn.best_params_['n_neighbors'])
print('kNN Test Score:', knn.score(X_test, y_test), ', runtime= ', runtime_knn)
plt.plot(kvals, knn.cv_results_['mean_test_score'], '-*')
plt.fill_between(
kvals,
knn.cv_results_['mean_test_score'] - 2 * knn.cv_results_['std_test_score'],
knn.cv_results_['mean_test_score'] + 2 * knn.cv_results_['std_test_score'],
alpha=.3)
plt.xlabel('k')
plt.ylabel('validation accuracy +- 2 std');
fig, ax = plt.subplots()
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.25, s=10)
overlay_decision_boundary(ax, knn, colors=class_colors, desaturate=.3)
LDA and QDA¶
LDA and QDA model each class as a Normal distribution. (Does that make sense for our data?)
To classify a sample, we use Bayes' rule:
$$P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(y = k)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}$$So the most likely class is:
$$\hat{y} = \arg\max_k P(y=k | X) = \arg\max_k P(X | y=k) P(y=k)$$lda = LinearDiscriminantAnalysis(store_covariance=True)#, priors=[1,1,1])
qda = QuadraticDiscriminantAnalysis(store_covariance=True)#, priors=[1,1,1])
t = time.time()
lda.fit(X_train, y_train)
fittime_lda = time.time()-t
t = time.time()
qda.fit(X_train, y_train)
fittime_qda = time.time()-t
t = time.time()
lda.predict(X_test)
infertime_lda = time.time() - t
t = time.time()
qda.predict(X_test)
infertime_qda = time.time() - t
print('LDA accuracy train={:.2%}, test: {:.2%}, fit time {:.2f}s, predict time {:.2f}s'.format(
lda.score(X_train, y_train), lda.score(X_test, y_test), fittime_lda, infertime_lda))
print('QDA accuracy train={:.2%}, test: {:.2%}, fit time {:.2f}s, predict time {:.2f}s'.format(
qda.score(X_train, y_train), qda.score(X_test, y_test), fittime_qda, infertime_qda))
Let's look at the model that LDA and QDA fit to the data. Do they make sense?
# Code credit: http://scikit-learn.org/stable/auto_examples/classification/plot_lda_qda.html
def plot_ellipse(ax, mean, cov, color):
v, w = np.linalg.eigh(cov)
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan(u[1] / u[0])
angle = 180 * angle / np.pi # convert to degrees
# filled Gaussian at 2 standard deviation
ell = matplotlib.patches.Ellipse(mean, 2 * v[0] ** 0.5, 2 * v[1] ** 0.5,
180 + angle, facecolor=color,
edgecolor='yellow',
linewidth=2, zorder=2)
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
fitted_models = [lda, qda]
titles = ['LDA','QDA']
class_colors=['r', 'g', 'b']
f, axes = plt.subplots(1,2, figsize = (16,7))
for i, (model, ax, title) in enumerate(zip(fitted_models, axes, titles)):
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.5, s=10)
for target, color in enumerate(class_colors):
if isinstance(model, QuadraticDiscriminantAnalysis):
# For QDA
covariance = model.covariance_[target]
else:
# LDA
covariance = model.covariance_
plot_ellipse(ax, model.means_[target], covariance, color)
ax.set_title(title)
What about those priors? Generally we estimate them from the data:
lda.priors_
data_train.response.value_counts(normalize=True)
... but we could use domain knowledge to set them.
fig, axs = plt.subplots(ncols=2, figsize=(16,8))
for ax, title, model in zip(axs, 'LDA QDA'.split(), [lda, qda]):
cnf_matrix = confusion_matrix(y_test, model.predict(X_test))
# cnf_matrix = confusion_matrix(np.full_like(y_test, 2), y_test)
plot_confusion_matrix(ax, cnf_matrix, classes=tgt_id2name, normalize=True)
ax.set_title(title)
# from sklearn 0.20
def balanced_accuracy_score(y_true, y_pred):
C = confusion_matrix(y_true, y_pred)
with np.errstate(divide='ignore', invalid='ignore'):
per_class = np.diag(C) / C.sum(axis=1)
if np.any(np.isnan(per_class)):
warnings.warn('y_pred contains classes not in y_true')
per_class = per_class[~np.isnan(per_class)]
score = np.mean(per_class)
return score
fitted_models = [lda, qda]
titles = ['LDA','QDA']
class_colors=['r', 'g', 'b']
f, axes = plt.subplots(1,2, figsize = (16,7))
data_sample = data_train.sample(frac=.05)
for i, (model, ax, title) in enumerate(zip(fitted_models, axes, titles)):
scatter_stars(
ax, data_sample,
columns=['u-g', 'g-r'],
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.25, s=10)
overlay_decision_boundary(ax, model, colors=class_colors, desaturate=.3)
ax.set_title(title)
# print("{}: accuracy on train={:.2%}, test={:.2%}".format(
# title, balanced_accuracy_score(y_train, model.predict(X_train)),
# balanced_accuracy_score(y_test, model.predict(X_test))))
We are not doing great at separating stars from galaxies. What other information could help us distinguish these two? Here is where the domain knowledge becomes important. Stars can be observed only when they are close enough, while galaxies that we observe and look like stars (point source like) are much further away. Redshift is a measure of distance and is one of the columns in our dataset. Lets see what happens if we include it.¶
plot_columns = ['u-g', 'g-r' , 'log_redshift' ]
col_pairs = list(combinations(plot_columns, 2))
n = len(col_pairs)
fig, axes = plt.subplots(nrows=n, figsize=(12, 7 * n))
data_sample = data_train.sample(frac=.1)
for ax, columns in zip(axes, col_pairs):
scatter_stars(ax, data_sample, columns, tgt_id2name, class_colors, alpha=.3)
fig.tight_layout()
predictors = ['u-g', 'log_redshift']
X_train = data_train.loc[:,predictors]
y_train = data_train.loc[:,'response']
X_test = data_test.loc[:,predictors]
y_test = data_test.loc[:,'response']
lda.fit(X_train, y_train)
qda.fit(X_train, y_train)
fitted_models = [lda, qda]
titles = ['LDA','QDA']
class_colors=['r', 'g', 'b']
f, axes = plt.subplots(1,2, figsize = (16,7))
data_sample = data_train.sample(frac=.05)
for i, (model, ax, title) in enumerate(zip(fitted_models, axes, titles)):
scatter_stars(
ax, data_sample,
columns=predictors,
class_labels=tgt_id2name, class_colors=class_colors, edgecolor='black', lw=.25, s=10)
overlay_decision_boundary(ax, model, colors=class_colors, desaturate=.3)
ax.set_title(title)
print("{}: accuracy on train={:.2%}, test={:.2%}".format(
title, model.score(X_train, y_train), model.score(X_test, y_test)))