CS109B Data Science 2: Advanced Topics in Data Science

Lab 7 - Clustering¶

Harvard University
Spring 2019
Instructors: Mark Glickman and Pavlos Protopapas

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2019-CS109B/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

In [2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [3]:

import rpy2

# import os
# os.environ['R_HOME'] = "/usr/share/anaconda3/lib/R"

from rpy2.robjects.packages import importr
r_utils = importr('utils')

In [4]:

#If there are errors about missing R packages, run the relevant code below:
#r_utils.install_packages('aplpack')
# r_utils.install_packages('TeachingDemos')
# r_utils.install_packages('ggplot2')
# r_utils.install_packages('cluster')
# r_utils.install_packages('factoextra')
# r_utils.install_packages('dbscan')
#
# #If you need to install ggplot2 version 2.2.1, then run the following code:
# r_utils.install_packages("devtools")
# r_remotes = importr("remotes")
# r_remotes.install_version("ggplot2",version="2.2.1")

Import and plot the data¶

We begin, as always, by importing and exploring the data. Here, we're exploring the crazy cluster data from the end of yesterday's lecture. As you know, by the end we'll have an essentially automated method for extracting the pattern in these data.

Python¶

In [5]:

multishapes = pd.read_csv("data/multishapes.csv")
scaled_df = multishapes[['x','y']]
scaled_df.describe()

Out[5]:

	x	y
count	1100.000000	1100.000000
mean	-0.081222	-0.625431
std	0.644967	1.176170
min	-1.489180	-3.353462
25%	-0.478839	-1.126752
50%	-0.132920	-0.297040
75%	0.366072	0.250817
max	1.492208	1.253874

In [6]:

scaled_df.plot.scatter(x='x',y='y',c='Black',title="Multishapes data",figsize=(11,8.5));

R¶

We'll work in parallel in R, as some later function work better in that language.

In [7]:

from rpy2.robjects import pandas2ri
pandas2ri.activate()

r_scaled_df = pandas2ri.py2ri(scaled_df)

There is an 'automatic' function in rpy2 to convert a pnadas data frame into an R object. However, this function isn't included in the package's documentation, so use it at your own risk.

The old style of building data frames will always work.

In [8]:

r_scaled_df = rpy2.robjects.DataFrame({"x": rpy2.robjects.FloatVector(scaled_df['x']), 
                                  "y": rpy2.robjects.FloatVector(scaled_df['y'])
                                 })
r_scaled_df

Out[8]:

R/rpy2 DataFrame (1100 x 2)

x	y
-0.803739	-0.853053
0.852851	0.367618
0.927180	-0.274902
-0.752626	-0.511565
...	...
0.930758	-2.366531
0.989555	-2.475225
0.979200	-2.637578
1.121931	-2.619054

Discussion

Above, we named a data frame 'scaled', but didn't actually scale it! When we're clustering should we scale or not?

Kmeans¶

We kick things off with the old workhorse of clustering: KMeans. The "who cares, it runs" code is below, but first a small conceptual digression on how/why Kmeans does[n't] work: http://web.stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

Lessons:

Initializations matter; run multiple times
Total Squared dsitance should never get worse during an update
KMeans can struggle with clusters that are close together; they can get lumped into one
There's no notion of 'not part of any cluster' or 'part of two clusters'

Python¶

In [9]:

from sklearn.cluster import KMeans

fitted_km = KMeans(n_clusters=5, init='random', n_init=5, random_state=109).fit(scaled_df)

display(fitted_km.cluster_centers_)
display(fitted_km.labels_[0:10])

array([[-0.35981394, -0.49384068],
       [-0.71581926, -2.48493748],
       [ 0.97688793, -2.38945886],
       [-0.28306103,  0.5234012 ],
       [ 0.58969517, -0.07953374]])

array([0, 4, 4, 0, 4, 4, 3, 0, 0, 0])

Above, we see two useful components of the kmeans fit: the coordinates of the 5 cluster centers, and the clusters assigned to (the first few) points of data

R¶

In R, Kmeans is part of the 'stats' library, so we first importr 'stats', then call the .kmeans function within that grouping. As always, we refer to R's documentation to figure out the name of the function and its arguments. Link to the documentation

We also set R's random number generator, which allows us to get exatcly reproducible results.

In [10]:

r_base = importr('base')
r_base.set_seed(109) #set seed for random number generation

r_stats = importr('stats')
r_km_out = r_stats.kmeans(scaled_df, 5, nstart=5)

display(r_km_out)
display(list(r_km_out.names))
display(r_km_out.rx2("size"))

ListVector with 9 elements.

cluster

IntVector with 1100 elements.

4

3

4

...

2

centers

Matrix with 10 elements.

-0.275288

0.976888

0.590826

-0.361774

...

-2.389459

-0.084235

-0.491818

-2.484937

totss

FloatVector with 1 elements.

1977.496339

...

size

IntVector with 5 elements.

261

60

294

275

210

iter

IntVector with 1 elements.

3

ifault

IntVector with 1 elements.

0

['cluster',
 'centers',
 'totss',
 'withinss',
 'tot.withinss',
 'betweenss',
 'size',
 'iter',
 'ifault']

IntVector with 5 elements.

261

60

294

275

210

Recall that R functions typically return something between a named list and a dictionary. You can display it directly, and access particular segments via .rx2()

Plotting¶

Python¶

As of 2019, python doesn't have many specialized plotting methods, you mostly have to do it yourself. Take note of matplotlib's c= argument to color items in a plot, and stacking two different plotting functions in the same cell.

In [11]:

plt.scatter(scaled_df['x'],scaled_df['y'], c=fitted_km.labels_);
plt.scatter(fitted_km.cluster_centers_[:,0],fitted_km.cluster_centers_[:,1], c='r', marker='h', s=100);

R¶

As before, R is loaded with custom plotting. We're upgrading how we call R for plotting-- once you %load_ext rpy2.ipython (once per notebook, like %matplotlib inline) use %R to run a single line of R code, or %%R to run a whole cell of R code. Transfer R objects from python to r by adding -i and the object name on the %%R line.

Summary:

%load_ext rpy2.ipython once
%%R when you want to run multiple lines of R code
- Won't send outputs back into python
- Will display plots in the notebook
Move robject variables to R via -i

In [12]:

%load_ext rpy2.ipython

In [13]:

%%R -i r_km_out -i r_scaled_df
library(factoextra)
fviz_cluster(r_km_out, r_scaled_df, geom = "point")

C:\Users\Will\Anaconda3\envs\109b\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

Selecting size: Elbow¶

This method measures the total (squared) distance from points to their cluster's centroid. Within a given cluster, this is equivalent, up to a factor of $2n$, to the summed (squared) distance from each point to the other points in the cluster (the phrasing we use in laer methods). Look for the place(s) where distance stops decreasing as much.

Python¶

In [14]:

wss = []
for i in range(1,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    wss.append(fitx.inertia_)

plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), wss, 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()

R¶

Nearly all of our plots in R today come from the handy library factoextra (and nearly all of our plots will be from the fviz_clust function). Becuase we ran library(factoextra) in R earlier, the library is still loaded and we can invoke fviz_clust directly. Also, the -i is optional: R remembers what r_scaled_df is from before, but it can be wise to always send over the latest version.

Summary:

R remembers things you've loaded, saved, or passed it before, just like variables persist across cells in jupyter
library(factoextra) and then fviz_nbclust is your main plotting tool
Remember to set nstart= above its default of 1 whenever you use Kmeans in R
Documentation here

In [15]:

%%R -i r_scaled_df
fviz_nbclust(r_scaled_df, kmeans, method="wss", nstart = 5)

C:\Users\Will\Anaconda3\envs\109b\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

Selecting size: Silhouette¶

Silhouette scores measure how close an observation is (on average) to points in its cluster, compared to the next-closest cluster's points. The range is [-1,1]; 0 indicates a point on the decision boundary (equal average closeness to points in both clusters), and negative values mean that datum might be better in a different cluster.

The silhouette score plotted below is the average of the above score across all points.

Python¶

The silhouette score is a metric available in sklearn. We have to manually loop over values of K, calculate, and plot.

In [16]:

from sklearn.metrics import silhouette_score

scores = [0]
for i in range(2,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    score = silhouette_score(scaled_df, fitx.labels_)
    scores.append(score)
    
plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), np.array(scores), 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Average Silhouette')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()

R¶

Again, fviz_clust will do the work, we just need to pass it a different method=.

In [17]:

%%R -i r_scaled_df
fviz_nbclust(r_scaled_df, kmeans, method="silhouette", nstart = 5)

C:\Users\Will\Anaconda3\envs\109b\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

Selecting size: Gap Statistic¶

The gap statistic compares within-cluster distances (like in silhouette), but instead of comparing against the second-best existing cluster for that point, it compares our clustering's overall average to average we'd see when data don't have clusters at all.

In essence, the within-cluster distances (in the elbow plot) will go down just becuse we have more clusters. We additionally calculate how much they'd go down on non-clustered data with the same spread as our data and subtract that trend out to produce the plot below.

Python¶

Again, we'd have to code it up ourselves. Though there are some implementations online, they're not ready for immediate use.

In [18]:

# you'd have to implement it yourself

R¶

And again, fviz_clust will do the work, we just need to pass it a different method=.

In [19]:

%%R -i r_scaled_df
fviz_nbclust(r_scaled_df, kmeans, method="gap", nstart = 5)

Exercise 1

Determine the optimal number of clusters
Re-fit a KNN model with that number of clusters

In [20]:

#your code here.

Asessing Fit: Silhouette¶

Python¶

Below, we borrow from an SKlearn example. The second plot may be overkill.

The second plot is JUST the first two dimensions in the data. It is not a PCA plot
If you only need the raw silhouette scores, use the silhouette_samples function

In [21]:

from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
#modified code from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

def silplot(X, clusterer, pointlabels=None):
    cluster_labels = clusterer.labels_
    n_clusters = clusterer.n_clusters
    
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(11,8.5)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters = ", n_clusters,
          ", the average silhouette_score is ", silhouette_avg,".",sep="")

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(0,n_clusters+1):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=200, lw=0, alpha=0.7,
                c=colors, edgecolor='k')
    xs = X[:, 0]
    ys = X[:, 1]
    
    if pointlabels is not None:
        for i in range(len(xs)):
            plt.text(xs[i],ys[i],pointlabels[i])

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

In [22]:

silplot(scaled_df.values, fitted_km)

For n_clusters = 5, the average silhouette_score is 0.41118597712860394.

R¶

To get the plot in R, we need to jump through a few hoops

import the cluster library (once)
call the silhouette function on cluster assignments and the inter-point distances
call fviz_silhouette on the result

Luckily, we can run multiple lines of R at a time

In [23]:

%%R -i r_km_out -i r_scaled_df
library(cluster)
sil = silhouette(r_km_out$cluster, dist(r_scaled_df))
fviz_silhouette(sil)

C:\Users\Will\Anaconda3\envs\109b\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

Exercise 2

Display the silhouette plots for your own model

Discussion

How do we know which of two silhouette plots is better?

In [24]:

#your code here

Algomerative¶

Aglomerative clustering merges clusters together from the bottom up. There are many possible rules about which cluster(s) to combine next. Ward's rule wants to have the lowest possible total within-cluster distance, so it merges the two clusters that will harm this objective least.

Python¶

Scipy has Ward's method implemented, though the call sequence is a little convoluted, and the code is slow.

In [25]:

import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

plt.figure(figsize=(11,8.5))
dist_mat = pdist(scaled_df, metric="euclidean")
ward_data = hac.ward(dist_mat)

hac.dendrogram(ward_data);

R¶

R's code runs much faster, and the code is cleaner. Note that hclust is from the cluster library and you'd need to import it if you haven't already.

In [26]:

%%R -i r_scaled_df
stacked_cluster = hclust(dist(r_scaled_df), method = "ward.D")
plot(stacked_cluster)

Discussion

How do you read a plot like the above? What are valid options for number of clusters, and how can you tell? Are some more valid than others?

DBscan¶

DBscan is one of many alternative clustering algorithms, that uses an intuitive notion of denseness to define clusters, rather than defining them by a central point as in Kmeans.

Python¶

DBscan is implemented in good 'ol sklearn, but there aren't great tools for working out the epsilon parameter.

In [27]:

# I couldn't find any easy code for epsilon-tuning plot

In [28]:

from sklearn.cluster import DBSCAN

fitted_dbscan = DBSCAN(eps=0.15).fit(scaled_df)

plt.scatter(scaled_df['x'],scaled_df['y'], c=fitted_dbscan.labels_);

R¶

R's dbscan is in the dbscan library. It comes with kNNdistplot for tuning epsilon, and fviz_cluster will make a nice plot.

In [29]:

%%R -i r_scaled_df
library(dbscan)
kNNdistplot(r_scaled_df,k=5)

C:\Users\Will\Anaconda3\envs\109b\lib\site-packages\rpy2-2.9.4-py3.6-win-amd64.egg\rpy2\robjects\pandas2ri.py:191: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  res = PandasDataFrame.from_items(items)

Remember, we can import and run functions like dbscan within Python

In [30]:

r_dbscan = importr("dbscan")

r_db_out = r_dbscan.dbscan(r_scaled_df, eps=0.15, minPts = 5)

Or run directly in R

In [31]:

%%R -i r_scaled_df

r_db_out = dbscan(r_scaled_df, eps=0.15, minPts=5)
fviz_cluster(r_db_out, r_scaled_df, ellipse = FALSE, geom = "point")

Exercise 3

Discussion

Use cross validation to select the optimal values of N and epsilon

In [32]:

#your code here?

Streching: The arrest data and guided practice¶

In this section you get to transfer the skills and code we learned above to new data.

Because of how much this resembles your individual HW, we'll go over the solutions in lab, but not post them in writing.

As always, we start by loading and exploring the data

In [33]:

arrest_data = pd.read_csv("data/USArrests.csv")
arrest_data['A'] = arrest_data['A'].astype('float64')
arrest_data['UrbanPop'] = arrest_data['UrbanPop'].astype('float64')
arrest_data.head()

Out[33]:

	State	M	A	UrbanPop	R
0	Alabama	13.2	236.0	58.0	21.2
1	Alaska	10.0	263.0	48.0	44.5
2	Arizona	8.1	294.0	80.0	31.0
3	Arkansas	8.8	190.0	50.0	19.5
4	California	9.0	276.0	91.0	40.6

Exercise

Scale the data

Discussion

Or don't

In [34]:

#your code here

Exercise

Convert the pandas dataframe to R

In [35]:

#your code here

Exercise

How many [KMeans] clusters do the 50 states fall into?
Remember: we've seen three different methods for determining number of clusters

In [36]:

#your code here

In [37]:

#your code here

In [38]:

#your code here

Exercise

Fit a k-means cclustering with the number of clusters you think is best

In [39]:

#your code here

Exercise

How good is your clustering?

In [40]:

#your code here

Exercise

Run an aglomerative clustering.
What's the benefit of this method? How many clusters does it suggest?

In [41]:

#your code here

Exercise

Run dbscan on this data. Remember to tune epsilon.
How well does it perform?

In [42]:

#your code here

Discussion

How did you synthsize the different suggestions for number of clusters?
What method, and what clustering seems correct?
Why is clustering useful?

Bonus: A Hint About Why Clustering The States Is Hard¶

In [43]:

numeric_cols = ['M','A','UrbanPop','R']
pca = PCA()
scaled_df = pd.DataFrame(StandardScaler().fit_transform(arrest_data[numeric_cols]),
                         columns = arrest_data[numeric_cols].columns,
                         index = arrest_data.index) 
pca_df = pca.fit_transform(scaled_df)

In [44]:

def biplot(scaled_data, fitted_pca, axis_labels, point_labels):
    
    pca_results = fitted_pca.transform(scaled_data)
    pca1_scores = pca_results[:,0]
    pca2_scores = pca_results[:,1]
    
    # plot each point in 2D post-PCA space
    plt.scatter(pca1_scores,pca2_scores)
    
    # label each point
    for i in range(len(pca1_scores)):
        plt.text(pca1_scores[i],pca2_scores[i], point_labels[i])
    
    #for each original dimension, plot what an increase of 1 in that dimension means in this space
    for i in range(fitted_pca.components_.shape[1]):
        raw_dims_delta_on_pca1 = fitted_pca.components_[0,i]
        raw_dims_delta_on_pca2 = fitted_pca.components_[1,i]
        plt.arrow(0, 0, raw_dims_delta_on_pca1, raw_dims_delta_on_pca2 ,color = 'r',alpha = 1)
        plt.text(raw_dims_delta_on_pca1*1.1, raw_dims_delta_on_pca2*1.1, axis_labels[i], color = 'g', ha = 'center', va = 'center')

plt.figure(figsize=(11,8.5))
plt.xlim(-3.5,3.5)
plt.ylim(-3.5,3.5)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
biplot(scaled_df, pca, axis_labels=scaled_df.columns, point_labels=arrest_data['State'])