Title¶

Clustering Example Notebook

Description :¶

This notebook provides example code based on the lecture material.

If you wish to run or edit the notebook, we recommend downloading it and running it either on your local machine or on JupyterHub.

In [1]:

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

Old Faithful and Clustering¶

In [2]:

faithful = pd.read_csv("data/faithful.csv")
display(faithful.head())
display(faithful.describe())

	eruptions	waiting
0	3.600	79
1	1.800	54
2	3.333	74
3	2.283	62
4	4.533	85

	eruptions	waiting
count	272.000000	272.000000
mean	3.487783	70.897059
std	1.141371	13.594974
min	1.600000	43.000000
25%	2.162750	58.000000
50%	4.000000	76.000000
75%	4.454250	82.000000
max	5.100000	96.000000

In [3]:

import seaborn as sns

plt.figure(figsize=(10,5))
plt.scatter(faithful["eruptions"], faithful["waiting"])
plt.xlabel("eruptions")
plt.ylabel("waiting")
plt.xlim(0,6)
plt.ylim(30,100)

plt.figure(figsize=(10,5))
sns.kdeplot(faithful["eruptions"], faithful["waiting"])
plt.scatter(faithful["eruptions"], faithful["waiting"])
plt.xlim(0,6)
plt.show(30,100)

/home/glickm/.local/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

There are two distinct modes to the data: one with eruption values (voulmes?) of 1 to 3 and low waiting times, and a second cluster with larger eruptions and longer waiting times. Notably, there are very few eruptions in the middle.

Review: PCA¶

First, we import data on different types of crime in each US state

In [4]:

USArrests = pd.read_csv("data/USArrests.csv")
USArrests['StateAbbrv'] = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT","DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ","NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC","SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
display(USArrests.head())
display(USArrests.describe())

	State	Murder	Assault	UrbanPop	Rape	StateAbbrv
0	Alabama	13.2	236	58	21.2	AL
1	Alaska	10.0	263	48	44.5	AK
2	Arizona	8.1	294	80	31.0	AZ
3	Arkansas	8.8	190	50	19.5	AR
4	California	9.0	276	91	40.6	CA

	Murder	Assault	UrbanPop	Rape
count	50.00000	50.000000	50.000000	50.000000
mean	7.78800	170.760000	65.540000	21.232000
std	4.35551	83.337661	14.474763	9.366385
min	0.80000	45.000000	32.000000	7.300000
25%	4.07500	109.000000	54.500000	15.075000
50%	7.25000	159.000000	66.000000	20.100000
75%	11.25000	249.000000	77.750000	26.175000
max	17.40000	337.000000	91.000000	46.000000

The data has more dimensinons than we can easily visualize, so we use PCA to condense it. As usual, we scale the data before applying PCA. (Note that we scale everything, rather than fitting on train and carrying that scaling to future data-- we won't be using a test set here, so it's correct to use all the data to scale).

In [5]:

from sklearn import preprocessing
df = USArrests[['Murder','Assault','UrbanPop','Rape']]
scaled_df = pd.DataFrame(preprocessing.scale(df), index=USArrests['State'], columns = df.columns)

fitted_pca = PCA().fit(scaled_df)
USArrests_pca = fitted_pca.transform(scaled_df)

The biplot function plots the first two PCA components, and provides some helpful annotations

In [6]:

def biplot(scaled_data, fitted_pca, original_dim_labels, point_labels):

    pca_results = fitted_pca.transform(scaled_data)
    pca1_scores = pca_results[:,0]
    pca2_scores = pca_results[:,1]

    # plot each point in 2D post-PCA space
    plt.scatter(pca1_scores,pca2_scores)

    # label each point
    for i in range(len(pca1_scores)):
        plt.text(pca1_scores[i],pca2_scores[i], point_labels[i])

    #for each original dimension, plot what an increase of 1 in that dimension means in this space
    for i in range(fitted_pca.components_.shape[1]):
        raw_dims_delta_on_pca1 = fitted_pca.components_[0,i]
        raw_dims_delta_on_pca2 = fitted_pca.components_[1,i]
        plt.arrow(0, 0, raw_dims_delta_on_pca1, raw_dims_delta_on_pca2 ,color = 'r',alpha = 1)
        plt.text(raw_dims_delta_on_pca1*1.1, raw_dims_delta_on_pca2*1.1, original_dim_labels[i], color = 'g', ha = 'center', va = 'center')

plt.figure(figsize=(8.5,8.5))
plt.xlim(-3.5,3.5)
plt.ylim(-3.5,3.5)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

biplot(scaled_df, fitted_pca,
       original_dim_labels=scaled_df.columns,
       point_labels=USArrests['StateAbbrv'])

The red arrows and green text give us a sense of direction. If any state had 'murder' increase by one (scaled) unit, it would move in the direction of the 'murder' line by that amount. An increase by one (scaled) unit of both 'murder' and 'Urban Pop' would apply both moves.

We can also make inferrences about what combination of crimes and population puts California at its observed point.

Extra: Variance Captured¶

As usual, we want to know how what proportion of the variance each PC captures

In [7]:

plt.figure(figsize=(11,8.5))
plt.plot(range(1,5),fitted_pca.explained_variance_ratio_,"-o")
plt.xlabel("Principal Component")
plt.ylabel("Proportion of Variance Explained")
plt.ylim(0,1)
plt.show()

print("Proportion of variance explained by each PC:")
print(fitted_pca.explained_variance_ratio_)

Proportion of variance explained by each PC:
[0.62006039 0.24744129 0.0891408  0.04335752]

Even more usefully, we can plot how much of the total variation we'd capture by using N PCs. The PCA-2 plot above has 86.7% of the total variance.

In [8]:

plt.figure(figsize=(11,8.5))
plt.plot(range(1,5),np.cumsum(fitted_pca.explained_variance_ratio_),"-o")
plt.xlabel("Principal Component")
plt.ylabel("Cumulative Proportion of Variance Explained")
plt.ylim(0,1.1)
plt.show()

print("Total variance capturted when using N PCA components:")
print(np.cumsum(fitted_pca.explained_variance_ratio_))

Total variance capturted when using N PCA components:
[0.62006039 0.86750168 0.95664248 1.        ]

Scaling and Distances¶

Returning to the arrest/crime data, we again inspect the data and its PCA plot

In [9]:

np.random.seed(123)
arrests_sample = USArrests.sample(6)
arrests_sample

Out[9]:

	State	Murder	Assault	UrbanPop	Rape	StateAbbrv
10	Hawaii	5.3	46	83	20.2	HI
13	Indiana	7.2	113	65	21.0	IN
30	New Mexico	11.4	285	70	32.1	NM
46	Washington	4.0	145	73	26.2	WA
18	Maine	2.1	83	51	7.8	ME
0	Alabama	13.2	236	58	21.2	AL

In [10]:

np.random.seed(123)
np.round(scaled_df.sample(6),2)

Out[10]:

	Murder	Assault	UrbanPop	Rape
State
Hawaii	-0.58	-1.51	1.22	-0.11
Indiana	-0.14	-0.70	-0.04	-0.03
New Mexico	0.84	1.38	0.31	1.17
Washington	-0.88	-0.31	0.52	0.54
Maine	-1.32	-1.06	-1.01	-1.45
Alabama	1.26	0.79	-0.53	-0.00

Distances¶

One of the key ideas in clustering is the distance or disimilarity between points. Euclidean distance is common, though one is free to define domain-specific measures of how similar/distant two observations are.

In [11]:

from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

The pdist function computes the distances between all pairs of data points (which can be quite expensive for large data). squareform turns the result into a numpy array (the raw format avoids storing redundant values)

The distances between a handful of states are shown below. Hawaii and Indiana are relatively similar on these variables, while Maine and New Mexico are relatively different.

In [12]:

dist_eucl = pdist(scaled_df,metric="euclidean")
distances = pd.DataFrame(squareform(dist_eucl), index=USArrests["State"].values, columns=USArrests["State"].values)

sample_distances = distances.loc[arrests_sample["State"], arrests_sample["State"]]
sample_distances

Out[12]:

	Hawaii	Indiana	New Mexico	Washington	Maine	Alabama
Hawaii	0.000000	1.561769	3.586656	1.560979	2.743631	3.422932
Indiana	1.561769	0.000000	2.617305	1.152154	2.124266	2.097219
New Mexico	3.586656	2.617305	0.000000	2.504780	4.390177	1.615635
Washington	1.560979	1.152154	2.504780	0.000000	2.655948	2.675068
Maine	2.743631	2.124266	4.390177	2.655948	0.000000	3.520494
Alabama	3.422932	2.097219	1.615635	2.675068	3.520494	0.000000

For visualization, we can make a heatmap of the sample state's distances

In [13]:

plt.figure(figsize=(11,8.5))
sns.heatmap(sample_distances,cmap="mako")
plt.show()

We can likewise heatmap all the states.

In [14]:

import seaborn as sns

plt.figure(figsize=(11,8.5))
sns.heatmap(distances)
plt.show()

Kmeans¶

Kmeans is a classical, workhorse clustering algorithm, and a common place to start. It assumes there are K centers and, starting from random guesses, algorithmically improves its guess about where the centers must be.

In [15]:

from sklearn.cluster import KMeans

#random_state parameter sets seed for random number generation
arrests_km = KMeans(n_clusters=3,n_init=25,random_state=123).fit(scaled_df)

arrests_km.cluster_centers_

Out[15]:

array([[-0.49440658, -0.3864845 ,  0.58167593, -0.26431024],
       [ 1.01513667,  1.02412028,  0.19959126,  0.85556386],
       [-0.88515915, -1.0213324 , -0.94990286, -0.92016524]])

We can read off where the 3 cluster centers are. (The value 3 is chosen arbitratially- soon we'll see how to tell what number of clusters seems to work best)

In [16]:

pd.DataFrame(arrests_km.cluster_centers_,columns=['Murder','Assault','UrbanPop','Rape'])

Out[16]:

	Murder	Assault	UrbanPop	Rape
0	-0.494407	-0.386484	0.581676	-0.264310
1	1.015137	1.024120	0.199591	0.855564
2	-0.885159	-1.021332	-0.949903	-0.920165

The .lables_ tell us which cluster each point was assigned to

In [17]:

scaled_df_cluster = scaled_df.copy()
scaled_df_cluster['Cluster'] = arrests_km.labels_
scaled_df_cluster.head()

Out[17]:

	Murder	Assault	UrbanPop	Rape	Cluster
State
Alabama	1.255179	0.790787	-0.526195	-0.003451	1
Alaska	0.513019	1.118060	-1.224067	2.509424	1
Arizona	0.072361	1.493817	1.009122	1.053466	1
Arkansas	0.234708	0.233212	-1.084492	-0.186794	2
California	0.281093	1.275635	1.776781	2.088814	1

The mean of the points in each cluster is the cluster center found by K-means

In [18]:

scaled_df_cluster.groupby('Cluster').mean()

Out[18]:

	Murder	Assault	UrbanPop	Rape
Cluster
0	-0.494407	-0.386484	0.581676	-0.264310
1	1.015137	1.024120	0.199591	0.855564
2	-0.885159	-1.021332	-0.949903	-0.920165

Silhouette Plots¶

Silhouette plots give rich information on the quality of a clustering

In [19]:

from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
#modified code from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

def silplot(X, cluster_labels, clusterer, pointlabels=None):
    n_clusters = clusterer.n_clusters

    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(11,8.5)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])

    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters = ", n_clusters,
          ", the average silhouette_score is ", silhouette_avg,".",sep="")

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(0,n_clusters+1):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)

    # axes will be first 2 PCA components
    pca = PCA(n_components=2).fit(X)
    X_pca = pca.transform(X)
    ax2.scatter(X_pca[:, 0], X_pca[:, 1], marker='.', s=200, lw=0, alpha=0.7,
                c=colors, edgecolor='k')
    xs = X_pca[:, 0]
    ys = X_pca[:, 1]


    if pointlabels is not None:
        for i in range(len(xs)):
            plt.text(xs[i],ys[i],pointlabels[i])

    # Labeling the clusters (transform to PCA space for plotting)
    centers = pca.transform(clusterer.cluster_centers_)
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("PC1")
    ax2.set_ylabel("PC2")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

In [20]:

fitted_km = KMeans(n_clusters=4,n_init=25,random_state=123).fit(scaled_df)
cluster_labels = fitted_km.labels_

silplot(scaled_df.values, cluster_labels, fitted_km)

For n_clusters = 4, the average silhouette_score is 0.33968891433344395.

In [21]:

# Objects with negative silhouette
sil = silhouette_samples(scaled_df, fitted_km.labels_)

USArrests.loc[sil<=0,:]

Out[21]:

	State	Murder	Assault	UrbanPop	Rape	StateAbbrv
24	Missouri	9.0	178	70	28.2	MO

Elbow plots¶

In [22]:

wss = []
for i in range(1,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    wss.append(fitx.inertia_)

plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), wss, 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()

Silhouette Score¶

In [23]:

from sklearn.metrics import silhouette_score

scores = [0]  # silhouette score for 1 cluster
for i in range(2,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    score = silhouette_score(scaled_df, fitx.labels_)
    scores.append(score)

print("Optimized at", max(range(len(scores)), key=scores.__getitem__)+1, "clusters")

plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), np.array(scores), 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Average Silhouette')
plt.title('The Silhouette Method showing the optimal $k$')
plt.show()

Optimized at 2 clusters

Gap Statistic¶

In [39]:

# need to install library 'gap-stat'
from gap_statistic import OptimalK

gs_obj = OptimalK()

n_clusters = gs_obj(scaled_df.values, n_refs=50, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

Optimal clusters:  4

In [40]:

gs_obj.gap_df.head()

Out[40]:

	n_clusters	gap_value	gap*	ref_dispersion_std	sk	sk*	diff	diff*
0	1.0	0.222159	50.359063	17.148526	0.070837	53.728106	-0.295650	52.722809
1	2.0	0.601961	87.335301	16.146239	0.084152	89.699046	0.076047	89.206030
2	3.0	0.610179	70.534332	12.829924	0.084266	72.405061	-0.072176	71.992951
3	4.0	0.765419	68.017105	10.291019	0.083064	69.475724	0.380796	69.630492
4	5.0	0.478021	41.794611	10.025900	0.093398	43.407999	-0.180014	42.850792

In [41]:

gs_obj.plot_results()

Hierarchical Clustering¶

K-means is a very 'hard' clustering: points belong to exactly one cluster, no matter what. A hierarchical clustering creates a nesting of clusters as existing clusters are merged or split.

Dendograms (literally: branch graphs) can show the pattern of splits/merges.

In [27]:

import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

plt.figure(figsize=(11,8.5))
dist_mat = pdist(scaled_df, metric="euclidean")
ward_data = hac.ward(dist_mat)

hac.dendrogram(ward_data, labels=USArrests["State"].values);
plt.show()

DBSCAN¶

DBSCAN is a more modern clustering approach that allows points to not be part of any cluster, and determines the number of clusters by itself.

First, let's look at out data

In [28]:

multishapes = pd.read_csv("data/multishapes.csv")
ms = multishapes[['x','y']]

In [29]:

msplot = ms.plot.scatter(x='x',y='y',c='Black',title="Multishapes data",figsize=(11,8.5))
msplot.set_xlabel("X")
msplot.set_ylabel("Y")
plt.show()

To the eye, there's a pretty clear structure to the data

However, K-means struggles to find a good clustering

In [30]:

shape_km = KMeans(n_clusters=5,n_init=25,random_state=123).fit(ms)

In [31]:

plt.figure(figsize=(10,10))
plt.scatter(ms['x'],ms['y'], c=shape_km.labels_);
plt.scatter(shape_km.cluster_centers_[:,0],shape_km.cluster_centers_[:,1], c='r', marker='h', s=100);
#todo: labels? different markers?

DB Scan uses a handful of parameters, including the number of neighbors a point must have to be considered 'core' (min_samples) and the distance within which neighbors must fall (epsilon). Most reasonable values of min_samples yeild the same results, but tuning epsilon is important.

The function below implement's the authors suggestion for setting epsilon: look at the nearest-neighbor distances and find a level where they begin to grow rapidly.

In [32]:

from sklearn.neighbors import NearestNeighbors

def plot_epsilon(df, min_samples):

    fitted_neigbors = NearestNeighbors(n_neighbors=min_samples).fit(df)
    distances, indices = fitted_neigbors.kneighbors(df)

    dist_to_nth_nearest_neighbor = distances[:,-1]
    plt.plot(np.sort(dist_to_nth_nearest_neighbor))
    plt.xlabel("Index\n(sorted by increasing distances)")
    plt.ylabel("{}-NN Distance (epsilon)".format(min_samples-1))
    plt.tick_params(right=True, labelright=True)

In [33]:

plot_epsilon(ms, 3)

The major slope occurs around eps=0.15 when min_samples set to 3.

In [34]:

from sklearn.cluster import DBSCAN

fitted_dbscan = DBSCAN(eps=0.15).fit(ms)

plt.figure(figsize=(10,10))
plt.scatter(ms['x'],ms['y'], c=fitted_dbscan.labels_);

We see good results with the suggested epsilon. A lower epsilon (0.12) won't quite merge all the clustersWe

DBSCAN on crime data¶

Returning to the crime data, let's tune epsilon and see what clusters are returned

In [35]:

plot_epsilon(scaled_df, 5)

The optimal value is either around 1.4 to 1.6. Choosing one versus the other does not result in different clusterings.

In [36]:

fitted_dbscan = DBSCAN(eps=1.4,min_samples=5).fit(scaled_df)
fitted_dbscan.labels_

Out[36]:

array([ 0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

At this epsilon and min_samples, all but one state are included in cluster 0. The remaining point (Alaska) is not part of any cluster. DBSCAN is not particularly effective using this data set.