Key Word(s): ??
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
Old Faithful and Clustering¶
faithful = pd.read_csv("data/faithful.csv")
display(faithful.head())
display(faithful.describe())
import seaborn as sns
plt.figure(figsize=(10,5))
plt.scatter(faithful["eruptions"], faithful["waiting"])
plt.xlabel("eruptions")
plt.ylabel("waiting")
plt.xlim(0,6)
plt.ylim(30,100)
plt.figure(figsize=(10,5))
sns.kdeplot(faithful["eruptions"], faithful["waiting"])
plt.scatter(faithful["eruptions"], faithful["waiting"])
plt.xlim(0,6)
plt.show(30,100)
There are two distinct modes to the data: one with eruption values (voulmes?) of 1 to 3 and low waiting times, and a second cluster with larger eruptions and longer waiting times. Notably, there are very few eruptions in the middle.
Review: PCA¶
First, we import data on different types of crime in each US state
USArrests = pd.read_csv("data/USArrests.csv")
USArrests['StateAbbrv'] = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT","DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ","NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC","SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
display(USArrests.head())
display(USArrests.describe())
The data has more dimensinons than we can easily visualize, so we use PCA to condense it. As usual, we scale the data before applying PCA. (Note that we scale everything, rather than fitting on train and carrying that scaling to future data-- we won't be using a test set here, so it's correct to use all the data to scale).
from sklearn import preprocessing
df = USArrests[['Murder','Assault','UrbanPop','Rape']]
scaled_df = pd.DataFrame(preprocessing.scale(df), index=USArrests['State'], columns = df.columns)
fitted_pca = PCA().fit(scaled_df)
USArrests_pca = fitted_pca.transform(scaled_df)
The biplot function plots the first two PCA components, and provides some helpful annotations
def biplot(scaled_data, fitted_pca, original_dim_labels, point_labels):
pca_results = fitted_pca.transform(scaled_data)
pca1_scores = pca_results[:,0]
pca2_scores = pca_results[:,1]
# plot each point in 2D post-PCA space
plt.scatter(pca1_scores,pca2_scores)
# label each point
for i in range(len(pca1_scores)):
plt.text(pca1_scores[i],pca2_scores[i], point_labels[i])
#for each original dimension, plot what an increase of 1 in that dimension means in this space
for i in range(fitted_pca.components_.shape[1]):
raw_dims_delta_on_pca1 = fitted_pca.components_[0,i]
raw_dims_delta_on_pca2 = fitted_pca.components_[1,i]
plt.arrow(0, 0, raw_dims_delta_on_pca1, raw_dims_delta_on_pca2 ,color = 'r',alpha = 1)
plt.text(raw_dims_delta_on_pca1*1.1, raw_dims_delta_on_pca2*1.1, original_dim_labels[i], color = 'g', ha = 'center', va = 'center')
plt.figure(figsize=(8.5,8.5))
plt.xlim(-3.5,3.5)
plt.ylim(-3.5,3.5)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
biplot(scaled_df, fitted_pca,
original_dim_labels=scaled_df.columns,
point_labels=USArrests['StateAbbrv'])
The red arrows and green text give us a sense of direction. If any state had 'murder' increase by one (scaled) unit, it would move in the direction of the 'murder' line by that amount. An increase by one (scaled) unit of both 'murder' and 'Urban Pop' would apply both moves.
We can also make inferrences about what combination of crimes and population puts California at its observed point.
Extra: Variance Captured¶
As usual, we want to know how what proportion of the variance each PC captures
plt.figure(figsize=(11,8.5))
plt.plot(range(1,5),fitted_pca.explained_variance_ratio_,"-o")
plt.xlabel("Principal Component")
plt.ylabel("Proportion of Variance Explained")
plt.ylim(0,1)
plt.show()
print("Proportion of variance explained by each PC:")
print(fitted_pca.explained_variance_ratio_)
Even more usefully, we can plot how much of the total variation we'd capture by using N PCs. The PCA-2 plot above has 86.7% of the total variance.
plt.figure(figsize=(11,8.5))
plt.plot(range(1,5),np.cumsum(fitted_pca.explained_variance_ratio_),"-o")
plt.xlabel("Principal Component")
plt.ylabel("Cumulative Proportion of Variance Explained")
plt.ylim(0,1.1)
plt.show()
print("Total variance capturted when using N PCA components:")
print(np.cumsum(fitted_pca.explained_variance_ratio_))
Scaling and Distances¶
Returning to the arrest/crime data, we again inspect the data and its PCA plot
np.random.seed(123)
arrests_sample = USArrests.sample(6)
arrests_sample
np.random.seed(123)
np.round(scaled_df.sample(6),2)
Distances¶
One of the key ideas in clustering is the distance or disimilarity between points. Euclidean distance is common, though one is free to define domain-specific measures of how similar/distant two observations are.
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
The pdist
function computes the distances between all pairs of data points (which can be
quite expensive for large data). squareform
turns the result into a numpy array (the
raw format avoids storing redundant values)
The distances between a handful of states are shown below. Hawaii and Indiana are relatively similar on these variables, while Maine and New Mexico are relatively different.
dist_eucl = pdist(scaled_df,metric="euclidean")
distances = pd.DataFrame(squareform(dist_eucl), index=USArrests["State"].values, columns=USArrests["State"].values)
sample_distances = distances.loc[arrests_sample["State"], arrests_sample["State"]]
sample_distances
For visualization, we can make a heatmap of the sample state's distances
plt.figure(figsize=(11,8.5))
sns.heatmap(sample_distances,cmap="mako")
plt.show()
We can likewise heatmap all the states.
import seaborn as sns
plt.figure(figsize=(11,8.5))
sns.heatmap(distances)
plt.show()
Kmeans¶
Kmeans is a classical, workhorse clustering algorithm, and a common place to start. It assumes there are K centers and, starting from random guesses, algorithmically improves its guess about where the centers must be.
from sklearn.cluster import KMeans
#random_state parameter sets seed for random number generation
arrests_km = KMeans(n_clusters=3,n_init=25,random_state=123).fit(scaled_df)
arrests_km.cluster_centers_
We can read off where the 3 cluster centers are. (The value 3 is chosen arbitratially- soon we'll see how to tell what number of clusters seems to work best)
pd.DataFrame(arrests_km.cluster_centers_,columns=['Murder','Assault','UrbanPop','Rape'])
The .lables_ tell us which cluster each point was assigned to
scaled_df_cluster = scaled_df.copy()
scaled_df_cluster['Cluster'] = arrests_km.labels_
scaled_df_cluster.head()
The mean of the points in each cluster is the cluster center found by K-means
scaled_df_cluster.groupby('Cluster').mean()
Silhouette Plots¶
Silhouette plots give rich information on the quality of a clustering
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
#modified code from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
def silplot(X, cluster_labels, clusterer, pointlabels=None):
n_clusters = clusterer.n_clusters
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(11,8.5)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters = ", n_clusters,
", the average silhouette_score is ", silhouette_avg,".",sep="")
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(0,n_clusters+1):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
# axes will be first 2 PCA components
pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)
ax2.scatter(X_pca[:, 0], X_pca[:, 1], marker='.', s=200, lw=0, alpha=0.7,
c=colors, edgecolor='k')
xs = X_pca[:, 0]
ys = X_pca[:, 1]
if pointlabels is not None:
for i in range(len(xs)):
plt.text(xs[i],ys[i],pointlabels[i])
# Labeling the clusters (transform to PCA space for plotting)
centers = pca.transform(clusterer.cluster_centers_)
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC2")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
fitted_km = KMeans(n_clusters=4,n_init=25,random_state=123).fit(scaled_df)
cluster_labels = fitted_km.labels_
silplot(scaled_df.values, cluster_labels, fitted_km)
# Objects with negative silhouette
sil = silhouette_samples(scaled_df, fitted_km.labels_)
USArrests.loc[sil<=0,:]
Elbow plots¶
wss = []
for i in range(1,11):
fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
wss.append(fitx.inertia_)
plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), wss, 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()
Silhouette Score¶
from sklearn.metrics import silhouette_score
scores = [0] # silhouette score for 1 cluster
for i in range(2,11):
fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
score = silhouette_score(scaled_df, fitx.labels_)
scores.append(score)
print("Optimized at", max(range(len(scores)), key=scores.__getitem__)+1, "clusters")
plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), np.array(scores), 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Average Silhouette')
plt.title('The Silhouette Method showing the optimal $k$')
plt.show()
Gap Statistic¶
# need to install library 'gap-stat'
from gap_statistic import OptimalK
gs_obj = OptimalK()
n_clusters = gs_obj(scaled_df.values, n_refs=50, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)
gs_obj.gap_df.head()
gs_obj.plot_results()
Hierarchical Clustering¶
K-means is a very 'hard' clustering: points belong to exactly one cluster, no matter what. A hierarchical clustering creates a nesting of clusters as existing clusters are merged or split.
Dendograms (literally: branch graphs) can show the pattern of splits/merges.
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist
plt.figure(figsize=(11,8.5))
dist_mat = pdist(scaled_df, metric="euclidean")
ward_data = hac.ward(dist_mat)
hac.dendrogram(ward_data, labels=USArrests["State"].values);
plt.show()
DBSCAN¶
DBSCAN is a more modern clustering approach that allows points to not be part of any cluster, and determines the number of clusters by itself.
First, let's look at out data
multishapes = pd.read_csv("data/multishapes.csv")
ms = multishapes[['x','y']]
msplot = ms.plot.scatter(x='x',y='y',c='Black',title="Multishapes data",figsize=(11,8.5))
msplot.set_xlabel("X")
msplot.set_ylabel("Y")
plt.show()
To the eye, there's a pretty clear structure to the data
However, K-means struggles to find a good clustering
shape_km = KMeans(n_clusters=5,n_init=25,random_state=123).fit(ms)
plt.figure(figsize=(10,10))
plt.scatter(ms['x'],ms['y'], c=shape_km.labels_);
plt.scatter(shape_km.cluster_centers_[:,0],shape_km.cluster_centers_[:,1], c='r', marker='h', s=100);
#todo: labels? different markers?
DB Scan uses a handful of parameters, including the number of neighbors a point must have to be
considered 'core' (min_samples
) and the distance within which neighbors must fall
(epsilon
). Most reasonable values of min_samples yeild the same results, but tuning
epsilon is important.
The function below implement's the authors suggestion for setting epsilon: look at the nearest-neighbor distances and find a level where they begin to grow rapidly.
from sklearn.neighbors import NearestNeighbors
def plot_epsilon(df, min_samples):
fitted_neigbors = NearestNeighbors(n_neighbors=min_samples).fit(df)
distances, indices = fitted_neigbors.kneighbors(df)
dist_to_nth_nearest_neighbor = distances[:,-1]
plt.plot(np.sort(dist_to_nth_nearest_neighbor))
plt.xlabel("Index\n(sorted by increasing distances)")
plt.ylabel("{}-NN Distance (epsilon)".format(min_samples-1))
plt.tick_params(right=True, labelright=True)
plot_epsilon(ms, 3)
The major slope occurs around eps=0.15 when min_samples set to 3.
from sklearn.cluster import DBSCAN
fitted_dbscan = DBSCAN(eps=0.15).fit(ms)
plt.figure(figsize=(10,10))
plt.scatter(ms['x'],ms['y'], c=fitted_dbscan.labels_);
We see good results with the suggested epsilon. A lower epsilon (0.12) won't quite merge all the clustersWe
DBSCAN on crime data¶
Returning to the crime data, let's tune epsilon and see what clusters are returned
plot_epsilon(scaled_df, 5)
The optimal value is either around 1.4 to 1.6. Choosing one versus the other does not result in different clusterings.
fitted_dbscan = DBSCAN(eps=1.4,min_samples=5).fit(scaled_df)
fitted_dbscan.labels_
At this epsilon
and min_samples
, all but one state are included in cluster
0. The remaining point (Alaska) is not part of any cluster. DBSCAN is not particularly effective
using this data set.