CS109A Introduction to Data Science

Lab 8: Discriminant Analysis - A tale of two cities (Solutions)

Harvard University
Fall 2018
Instructors: Pavlos Protopapas, Kevin Rader
Authors: Rahul Dave and Margo Levine


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Table of Contents

  1. Learning Goals
  2. Temporal patterns in urban demographics
  3. Geographic patterns in urban demographics

Part 0: Learning Goals

In this lab we'll work with demographics of a region of the cities of Pavlopolis and Kevinsville from the years 2000 to 2010. We'll use the data to predict household economic status from its geographical location.

By the end of this lab, you will be able to:

  • Implement different classifiers and calculate predictive accuracy of these classifiers,

  • Compare different classifiers for general predictive accuracy.

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
from scipy.stats import mode
from sklearn import linear_model
import matplotlib
import matplotlib.pyplot as plt
from sklearn import discriminant_analysis as da
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split
%matplotlib inline

Predicting Urban Demographic Changes

Part 1: Temporal patterns in urban demographics

For this first part of lab, we aim to build a model for the city of Pavlopolis that accurately predicts household economic status from its geographical location. We want the model to be robust, that is, the model should be accurate over many years (through 2010 and beyond).

The data is contained in dataset_1_year_2000.txt, ..., dataset_1_year_2010.txt. The first two columns of each dataset are the adjusted latitude and longitude of some randomly sampled houses. The last column contains economic status of a household:

0: low-income,

1: middle-class,

2: high-income

You are given some additional information

Due to the migration of people in and out of the city, the distribution of each economic group over this region changes over the years. The city of Pavlopolis estimates that in this region there is approximately a 25% yearly increase in high-income households; and a 25% decrease in the remaining population, with the decrease being roughly the same amongst both the middle class and lower income households.

Visualization

We start by importing and visualizing the data.

In [2]:
#Load one file, check the data type, check that the data matches the description above
data2000_glance = np.loadtxt('../data/dataset_1_year_2000.txt')
print(type(data2000_glance))
data2000_glance[:10, :]

Out[2]:
array([[ 0.54432757,  0.6245105 ,  2.        ],
       [ 0.59468476,  0.72391292,  2.        ],
       [ 0.70018033,  0.78249208,  2.        ],
       [ 0.60126202,  0.97181171,  2.        ],
       [ 0.63199522,  0.74850249,  2.        ],
       [ 0.49427826,  0.60880108,  2.        ],
       [ 0.6096247 ,  0.67672357,  2.        ],
       [ 0.76899244,  0.86092404,  2.        ],
       [ 0.59429045,  0.82042128,  2.        ],
       [ 0.45593116,  0.74428878,  2.        ]])
In [3]:
#Visualize data for every other year
fig, ax = plt.subplots(2, 3, figsize = (12, 5))

#Create index for subplots
ind = 0
#Iterate from year 0 to 10, with steps of 2
for i in range(0, 11, 2):
    
    #Load data
    data = np.loadtxt('../data/dataset_1_year_' + str(2000 + i) + '.txt')
    
    #Split into predictor/response
    x = data[:, :-1]
    y = data[:, -1]

    #Plot each class for the current year in different colors         
    ax[ind // 3, ind % 3].scatter(x[y == 2, 0], x[y == 2, 1], color='green', alpha=0.4,
                                 label = '2-High income')
    ax[ind // 3, ind % 3].scatter(x[y == 1, 0], x[y == 1 ,1], 
                                 color='blue', alpha=0.4,
                                 label = '1-Middle class')
    ax[ind // 3, ind % 3].scatter(x[y == 0, 0], x[y == 0, 1], 
                                 color='red', alpha=0.4,
                                 label = '0-Low income')
    
    # LABEL AXIS, TITLE
    ax[ind // 3, ind % 3].set_xlabel('Latitude')
    ax[ind // 3, ind % 3].set_ylabel('Longitude')
    ax[ind // 3, ind % 3].set_title(str(2000 + i))
    
    #Update index
    ind = ind + 1

plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.tight_layout()

EXERCISE: What do you observe from the data visualizations?

YOUR DISCUSSION HERE:

  • The number of households, and there areal distribution in middle and low income brackets are decreasing over the years while the number in the high income bracket is increasing (i.e. the cluster of green gets bigger, the other clusters get smaller).
  • The number of dwellings in the middle and low income neighborhoods are decreasing, while the number is increasing in the high income neighborhood (i.e. there are less dots of any color in the red and blue regions). This can be due to the fact that, while some housing in the middle and low income neighborhoods are becoming high-income housing, simultaneously, the number of residential housing options may be decreasing in these neighborhoods, e.g. changes in zoning.

Model comparison

Let's compare the the following four classification schemes (on predictive accuracy) for addressing the problem of predicting household economic status from geographic location:

  1. Training a logistic regression model on one year and use it repeatedly
  2. Training a Q/LDA model on one year and use it repeatedly
  3. Training a KNN model on one year and use it repeatedly
  4. Using Bayes Theorem to update LDA/QDA classifiers on a year-by-year basis

We'll now train three single models (logistic regression, LDA, and KNN) once, and explore the accuracy of these models over time.

EXERCISE: Fill in the code below to fit a logistic regression (variable logreg, with C=1000), KNN (variable knn, for the default k=5), and lda (variable lda) models on the 2000 data.

In [31]:
#Load data for year 2000
data = np.loadtxt('../data/dataset_1_year_2000.txt')

#Split predictors, response
x = data[:, :-1]
y = data[:, -1]

#your code begins here
#Fit a logistic regression model on year 2000 data
logreg = linear_model.LogisticRegression(C=1000)
logreg.fit(x, y)

#Fit a lda model on year 2000 data
lda = da.LinearDiscriminantAnalysis()
lda.fit(x, y)

#Fit a knn model on year 2000 data
knn = KNN()
knn.fit(x, y)
#your code ends here

logregtrain = logreg.score(x, y)
ldatrain = lda.score(x, y)
knntrain = knn.score(x, y)

acc_log = []
acc_lda = []
acc_knn = []

#Iterate through all the years
for i in range(1, 11):
    #Load data for year 2000 + i 
    data_i = np.loadtxt('../data/dataset_1_year_' + str(2000 + i) + '.txt')
    
    #Split predictors, response
    x_i = data_i[:, :-1]
    y_i = data_i[:, -1]

    #Compute predictive accuracies of various models trained on 2000 data
    acc_log.append(logreg.score(x_i, y_i))
    acc_lda.append(lda.score(x_i, y_i))
    acc_knn.append(knn.score(x_i, y_i))
            
#Plot accuracy over years
years = np.arange(1,11) + 2000 # x-axis is years 2001-2010
width = 0.25 # width of bar

plt.figure(figsize = (10,3))
plt.bar(years, acc_log, width, color='blue', alpha=0.5, label='LogReg')
plt.axhline(logregtrain, 0, 1, color="blue", label="LogReg 2000")
plt.bar(years + width, acc_lda, width, color='red', alpha=0.5, label='LDA')
plt.axhline(ldatrain, 0, 1, color="red", label="LDA 2000")


plt.bar(years + 2 * width, acc_knn, width, color='green', alpha=0.5, label='KNN, k=5')
plt.axhline(knntrain, 0, 1, color="green", label="KNN 2000")



#Labels
plt.xlabel('Year')
plt.ylabel('Prediction Accuracy')
plt.legend(loc = 'upper left', bbox_to_anchor=(1,1)) # legend upper left outside
plt.xticks(years + width + 0.125, years); # set x-ticks spacing

EXERCISE: Are these three models robust? That is, do these models perform well over the years? What about Overfitting?

YOUR DISCUSSION HERE: All three models perform well in the years immediately following 2000 but their performaces decreases drastically as the years go on. Thus, training the model once and using it for many years is not a great idea.

We'll now consider the remaining modeling option:

Training a Q/LDA model yearly

What does it mean to train a QDA or LDA model for each year? Recall that Q/LDA classifiers first model the distribution of the data within each class using a normal distribution, $\mathcal{N}(\mu, \Sigma)$. Then the Q/LDA models the distribution of the data amongst the classes. For us, since we have three classes, this means that our "model" consists of the following 9 pieces of information (three for each class, two to define the distribution of data in the class and one to define the class proportion):

  1. Class 0:

    • Distribution: $p(x | y = 0) = \mathcal{N}(\mu_0, \Sigma_0)$
    • Proportion: $p(y=0)=\pi_0 = \frac{\text{data in Class 0}}{\text{total number of data}}$
  2. Class 1:

    • Distribution: $p(x | y = 1) = \mathcal{N}(\mu_1, \Sigma_1)$
    • Proportion: $p(y=1)=\pi_1 = \frac{\text{data in Class 1}}{\text{total number of data}}$
  3. Class 2:

    • Distribution: $p(x | y = 2) = \mathcal{N}(\mu_2, \Sigma_2)$
    • Proportion: $p(y=2)=\pi_2 = \frac{\text{data in Class 2}}{\text{total number of data}}$

Retraining a Q/LDA model means re-estimating $\mu, \Sigma$ and $\pi$ for each class.

EXERCISE: At the beginning of lab we were given information on the projected yearly growth rate of the classes, so we don't need the entire set of population data to re-estimate $\pi$ for each class. Use this information to write down estimates for $\pi_2^{\text{year i + 1}}$. Then, assuming that the total population is approximately constant over the years, estimate $\pi_0^{\text{year i + 1}}$ and $\pi_1^{\text{year i + 1}}$.

YOUR DISCUSSION HERE: We have $$ \pi_2^{\text{year i + 1}} = \pi_2^{\text{year i}} + 0.25 * \pi_2^{\text{year i}} = 1.25 * \pi_2^{\text{year i}} $$ Since the total population stays roughly constant throughout the years, and the remaining two classes are decreasing by the same amount each year, and since $\pi_0^{2000} \approx \pi_1^{2000}$, we estimate the proportions for Class 0 and Class 1: $$ \pi_0^{\text{year i + 1}} = \pi_1^{\text{year i + 1}} = \frac{1 - 1.25 * \pi_2^{\text{year i}}}{2} $$ How do we know that the total population is constant? We don't for certain. The number of observations for each year is approximately the same, but these are random samples from the population so they may have just been chosen to be the same.

On the other hand, if the distribution of the data within each class changes (i.e. $\mu$ and $\Sigma$ changes within each class) then we will need re-estimate them using new data each year. Let's check, using data visualization, to see if this is indeed the case.

In [8]:
#Visualize data for every other year
fig, ax = plt.subplots(3, 6, figsize = (23, 10))

#Iterate from year 0 to 10, with steps of 2
for i in range(0, 11, 2):
    #Load data
    data = np.loadtxt('../data/dataset_1_year_' + str(2000 + i) + '.txt')
    
    #Split into predictor/response
    x = data[:, :-1]
    y = data[:, -1]
    
    
    #Plot each class for the current year in different colors
    ax[0, i // 2].scatter(x[y == 2, 0], x[y == 2, 1],
                         color='g', alpha=0.6,
                         label = '2-High income')
    ax[1, i // 2].scatter(x[y == 1, 0], x[y == 1 ,1], 
                         color='b', alpha=0.6,
                         label = '1-Middle class')
    ax[2, i // 2].scatter(x[y == 0, 0], x[y == 0, 1], 
                         color='r', alpha=0.6,
                         label = '0-Low income')
    
    #Labels
    ax[0, i // 2].set_xlabel('Latitude')
    ax[0, i // 2].set_ylabel('Longitude')
    ax[0, i // 2].set_title(str(2000 + i))
    
    #Labels
    ax[1, i // 2].set_xlabel('Latitude')
    ax[1, i // 2].set_ylabel('Longitude')
    ax[1, i // 2].set_title(str(2000 + i))
    
    # LABEL AXIS, TITLE
    ax[2, i // 2].set_xlabel('Latitude')
    ax[2, i // 2].set_ylabel('Longitude')
    ax[2, i // 2].set_title(str(2000 + i))


plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.tight_layout()

EXERCISE: Comment on how the distribution in each class appears to change over the years. Does this effect the efficiency of Q/LDA models?

YOUR DISCUSSION HERE: It's reasonable to conclude from visualization of the data in each class over the years that the distribution within each class is not changing. That is, the data in each class is approximately normally distributed and the shapes of these distributions are approximately constant over time (the number of points in the class don't theoretically contribute to the shape of the distribution, shape describes the variance in each direction). From this we find it reasonable to conclude that the parameters $\mu$ and $\Sigma$ for each class do not need to be updated each year. Thus, updating our Q/LDA model yearly involves only re-estimating the class proportions using a formula based on projected yearly growth!

So how efficient is our Q/LDA model? We train the model once on a single year's worth of data. Then, for any subsequent year, the update is just a few lines of arithmetic. For example, say we train our Q/LDA model on 2000 data, given a point from 2001, $x^{2001}$, we predict the class label for $x^{2001}$ by:

  1. find the probability that $x^{2001}$ will be labeled Class 0 to 2 using our model from 2000: $$ \begin{aligned} p_{2000}(y=0 | x^{2001}) &= p_{2000}(x^{2001} | y=0) p_{2000}(y=0) = \mathcal{N}(x^{2001}; \mu_0, \Sigma_0) \pi_0^{2000}\\ p_{2000}(y=1 | x^{2001}) &= p_{2000}(x^{2001} | y=1) p_{2000}(y=1) = \mathcal{N}(x^{2001}; \mu_1, \Sigma_1) \pi_1^{2000}\\ p_{2000}(y=2 | x^{2001}) &= p_{2000}(x^{2001} | y=2) p_{2000}(y=2) = \mathcal{N}(x^{2001}; \mu_2, \Sigma_2) \pi_2^{2000} \end{aligned} $$
  2. adjust the probabilities predicted using the 2000 model, by updating $\pi_i$ with 2001 class proportions: $$ \begin{aligned} p_{2001}(y=0 | x^{2001}) &= p_{2000}(y=0 | x^{2001}) \frac{\pi_0^{2001}}{\pi_0^{2000}}\\ p_{2001}(y=1 | x^{2001}) &= p_{2000}(y=1 | x^{2001}) \frac{\pi_1^{2001}}{\pi_1^{2000}}\\ p_{2001}(y=2 | x^{2001}) &= p_{2000}(y=2 | x^{2001}) \frac{\pi_2^{2001}}{\pi_2^{2000}}\\ \end{aligned} $$
  3. take the maximum of $\{p_{2001}(y=0 | x^{2001}), p_{2001}(y=1 | x^{2001}), p_{2001}(y=2 | x^{2001}) \}$ and predict the class label corresponding to the maximum probability.

EXERCISE: Fill in the code below to re-estimate the class proportions (see equations from two exercises ago).

In [11]:
#Load data for year 2000
data = np.loadtxt('../data/dataset_1_year_2000.txt')

#Split predictors, response
x = data[:, :-1]
y = data[:, -1]

#your code begins here
#Fit a lda model on year 2000 data
lda = da.LinearDiscriminantAnalysis()
lda.fit(x, y)
#your code ends here

ldatrain = lda.score(x, y)


acc_lda_corrected = [] # Store corrected LDA accuracies

#Store class proportions in 2000
p0_2000 = (y==0).mean()
p1_2000 = (y==1).mean()
p2_2000 = (y==2).mean()

#Store class proportions in current year
p0_current = (y==0).mean()
p1_current = (y==1).mean()
p2_current = (y==2).mean()

for k in range(1, 11):
    #Load data
    data_i = np.loadtxt('datasets/dataset_1_year_' + str(2000 + k) + '.txt')
    
    #Split into predictor, response
    x_i = data_i[:, :-1]
    y_i = data_i[:, -1]
    
    #x_train, x_test, y_train, y_test = train_test_split(x_i, y_i, test_size=0.30, random_state=0)
    
    #your code begins here
    #Re-estimate class proportions (25% increase in p2, adjust p0, p1 accordingly)
    p2_current = p2_current * 1.25
    p0_current = (1 - p2_current) / 2
    p1_current = (1 - p2_current) / 2
    #your code ends here
    
    #Re-estimate class label probabilities 
    pred_logprob = lda.predict_log_proba(x_i) # compute log-probabilities
    pred_logprob[:, 0] = pred_logprob[:, 0] + np.log(p0_current / p0_2000) # correction for class 0
    pred_logprob[:, 1] = pred_logprob[:, 1] + np.log(p1_current / p1_2000) # correction for class 1
    pred_logprob[:, 2] = pred_logprob[:, 2] + np.log(p2_current / p2_2000) # correction for class 2

    
    #Predict class label using re-estimated probabilities
    y_pred = pred_logprob.argmax(axis = 1)
    
    #Compute accuracy 
    acc_lda_corrected.append(np.mean(y_i == y_pred))
    
#Plot accuracy over years
years = np.arange(1, 11) + 2000 # x-axis is years 2001-2010
width = 0.25 # width of bar

plt.figure(figsize=(10,3))
plt.bar(years + width, acc_lda, width, color='red', alpha=0.5, label='LDA')
plt.bar(years + 2 * width, acc_lda_corrected, width, color='green', alpha=0.5, label='LDA yearly')



#Labels
plt.xlabel('Year')
plt.ylabel('Prediction Accuracy')
plt.legend(loc = 'upper left', bbox_to_anchor=(1, 1)) # legend upper left outside
plt.title('Overall predictive accuracy of LDA models')
plt.xticks(years + width + 0.125, years); # set x-ticks spacing
plt.axhline(ldatrain, 0, 1, color="red", label="LDA 2000")
Out[11]:

Compared to training a model once and using it repeatedly, training the Q/LDA yearly performs better in terms of predictive accuracy. It isn't significantly more expensive though.

Note: Why LDA and not QDA? Well since we've already observed that the distributions of data within the three classes are very simliar (in terms of shape) we can assume that the covariance matrices for all three classes will be the same - hence our model is LDA rather than QDA

Conclusions: Based on the data visualization from the first 5 years, it's clear that while the class proportions are changing, the class distributions are not (the center and shape of the three classes are constant through time). Thus, updating our LDA on a yearly basis is simple, as this just requires us to adjust our estimates of the class proportions $\pi$.

Part 2: Geographic patterns in urban demographics

In dataset_2.txt and dataset_3.txt you have the demographic information for a random sample of houses in two regions in Kevinsville. There are only two economic brackets for the households in these datasets:

0: low-income or middle-class,

1: high-income.

The data for each region is shown in the visualizations below.

In [12]:
##need to add legend
fig, ax = plt.subplots(1, 2, figsize=(10, 3))

data = np.loadtxt('../data/dataset_2.txt')
x = data[:,0:-1]
y = data[:,-1]

# Plot data
ax[0].scatter(x[y == 1, 0], x[y == 1, 1], color='b', alpha=0.2)
ax[0].scatter(x[y == 0, 0], x[y == 0, 1], color='r', alpha=0.2)

# Label axes, set title
ax[0].set_title('dataset_2')
ax[0].set_xlabel('Latitude')
ax[0].set_ylabel('Longitude')

data = np.loadtxt('datasets/dataset_3.txt')
x = data[:,0:-1]
y = data[:,-1]

# Plot data
ax[1].scatter(x[y == 1, 0], x[y == 1, 1], color='b', alpha=0.2)
ax[1].scatter(x[y == 0, 0], x[y == 0, 1], color='r', alpha=0.2)

# Label axes, set title
ax[1].set_title('dataset_3')
ax[1].set_xlabel('Latitude')
ax[1].set_ylabel('Longitude')

plt.show()

EXERCISE: For each region, comment on the appropriateness of using KNN for classification.

YOUR DISCUSSION HERE: Since the classes in dataset_2 overlap, we anticipate that KNN will do poorly; similarly, since the classes in dataset_3 are well-separated, we expect KNN to do well.

Below is code for visualizing decision boundaries.

In [27]:
#--------  plot_decision_boundary
# A function that visualizes the data and the decision boundaries
# Input: 
#      x (predictors)
#      y (labels)
#      poly_flag (a boolean parameter, fits quadratic model if true, otherwise linear)
#      title (title for plot)
#      ax (a set of axes to plot on)
# Returns: 
#      ax (axes with data and decision boundaries)

def plot_decision_boundary(x, y, model, poly_flag, title, ax):
    # Plot data
    ax.scatter(x[y == 1, 0], x[y == 1, 1], c='red', label='Normal', cmap=plt.cm.coolwarm, alpha=0.2, s=20)
    ax.scatter(x[y == 0, 0], x[y == 0, 1], c='blue', label='Normal', cmap=plt.cm.coolwarm, alpha=0.2, s=20)
    
    # Create mesh
    interval = np.arange(0,1,0.01)
    n = np.size(interval)
    x1, x2 = np.meshgrid(interval, interval)
    x1 = x1.reshape(-1, 1)
    x2 = x2.reshape(-1, 1)
    xx = np.concatenate((x1, x2), axis=1)

    # Predict on mesh points
    if(poly_flag):
        quad_features = preprocessing.PolynomialFeatures(degree=2)
        xx = quad_features.fit_transform(xx)
    yy = model.predict(xx)    
    yy = yy.reshape((n, n))

    # Plot decision surface
    x1 = x1.reshape(n, n)
    x2 = x2.reshape(n, n)
    ax.contourf(x1, x2, yy, cmap=plt.cm.coolwarm, alpha=0.1)

    
    # Label axes, set title
    ax.set_title(title)
    ax.set_xlabel('Latitude')
    ax.set_ylabel('Longitude')
    
    return ax
In [28]:
#--------  fit_and_plot_models
# A function that visualizes the data and the decision boundaries
# Input: 
#      file_name (name of the file containing dataset)
# Returns: 
#      None

def fit_and_plot_models(file_name): 
    data = np.loadtxt(file_name)
    x = data[:,0:-1]
    y = data[:,-1]

    fig, ax = plt.subplots(1, 4, figsize=(15, 3))

    # Plain Logistic Regression
    logreg = linear_model.LogisticRegression()
    logreg.fit(x, y)
    acc_logreg = logreg.score(x, y)

    str_title = 'LogReg (acc = ' + str(acc_logreg) + ')'
    ax[0] = plot_decision_boundary(x, y, logreg, False, str_title, ax[0])

    # LDA
    lda = da.LinearDiscriminantAnalysis()
    lda.fit(x, y)
    acc_lda = lda.score(x, y)

    str_title = 'LDA (acc = ' + str(acc_lda) + ')'
    ax[1] = plot_decision_boundary(x, y, lda, False, str_title, ax[1])

    # QDA
    qda = da.QuadraticDiscriminantAnalysis()
    qda.fit(x, y)
    acc_qda = qda.score(x, y)

    str_title = 'QDA (acc = ' + str(acc_qda) + ')'
    ax[2] = plot_decision_boundary(x, y, qda, False, str_title, ax[2])

    # Logistic Regression with Quadratic Terms
    quad_features = preprocessing.PolynomialFeatures(degree = 2)
    x_expanded = quad_features.fit_transform(x)
    logreg_poly = linear_model.LogisticRegression(C=1000)
    logreg_poly.fit(x_expanded, y)
    acc_logreg_poly = logreg_poly.score(x_expanded, y)
    
    str_title = 'LogReg-poly (acc = ' + str(acc_logreg_poly) + ')'
    ax[3] = plot_decision_boundary(x, y, logreg_poly, True, str_title, ax[3])
    
    plt.tight_layout()
    plt.show()

EXERCISE: Choose between Q/LDA and logistic regression by visualizing various types decision boundaries.

In [29]:
#your code here
fit_and_plot_models('../data/dataset_2.txt')
In [30]:
#your code here
fit_and_plot_models('../data/dataset_3.txt')

YOUR DISCUSSION HERE: In the both datasets, the optimal decision boundary is quadratic (based on over all training accuracy and common sense). So both LDA and linear Logistic Regression yield poor prediction accuracy even on the training set.

Since the first data set satisfies the model assumptions in QDA (i.e. the distribution of the data within each class is normal), we are not surprised that QDA provides a similar fit as that of (unregularized) logistic with quadratic boundaries (the latter may be overfitting).

For the second data set, however, the model assumptions in QDA fail (the red class is nomally distributed, but the blue is not). On the other hand, logistic regression make no assumptions regarding the distribution of the data, it makes more sense to recommend a logistic regression model (although you should incorporate regularization to curb the over fitting).