What is Silhouette analysis (S.A.)?: S.A. is a way to measure how close each point in a cluster is to the points in its neighboring clusters. Its a neat way to find out the optimum value for k during k-means clustering. Silhouette values lies in the range of [-1, 1]. A value of +1 indicates that the sample is far away from its neighboring cluster and very close to the cluster its assigned. Similarly, value of -1 indicates that the point is close to its neighboring cluster than to the cluster its assigned. And, a value of 0 means its at the boundary of the distance between the two cluster. Value of +1 is idea and -1 is least preferred. Hence, higher the value better is the cluster configuration.
Mathematically: Lets define Silhouette for each of the sample in the data set.
For an example (i) in the data, lets define a(i) to be the mean distance of point (i) w.r.t to all the other points in the cluster its assigned (A). We can interpret a(i) as how well the point is assigned to the cluster. Smaller the value better the assignment.
Similarly, lets define b(i) to be the mean distance of point(i) w.r.t. to other points to its closet neighboring cluster (B). The cluster (B) is the cluster to which point (i) is not assigned to but its distance is closest amongst all other cluster.
Thus, the silhouette s(i) can be calculated as
s(i) = (b(i) - a(i)) -------------- max(b(i), a(i))
We can easily say that s(i) lies in the range of [-1,1].
For s(i) to be close to 1, a(i) has be be very small as compared to b(i), i.e. a(i) <<< b(i). This happens when a(i) is very close to its assigned cluster. A large value of b(i) implies its extremely far from its next closest cluster. Hence, s(i) == 1 indicates that the data set (i) is well matched in the cluster assignment.
The above definition talks about Silhouette score for each data item. Although, its nice for visual analysis, it doesn’t quickly tell the SA score for the entire cluster.
Mean Silhouette score: Mean score can be simply calculated by taking the mean of silhouette score of all the examples in the data set. This gives us one value representing the Silhouette score of the entire cluster.
Advantages of using S.A: The best advantage of using S.A. score for finding the best number of cluster is that you use it for un-labelled data set. This is usually the case when running k-means. Hence, I prefer this over other k-means scores like V-measure, Adjusted rank Index, V-score, Homogeneity etc
Left pic: depicts a sorted list of SA cluster of each point in a given cluster. The black region is the plot of S score for examples belonging to cluster 0, whereas green plot is the S score for examples belonging to cluster 1.
The red doted line is the mean S. score for the cluster in consideration. The value is roughly around 0.7
For this to be a good value for number of cluster, one should consider the following points
- Firstly, The mean value should be as close to 1 as possible
- Secondly, The plot of each cluster should be above the mean value as much as possible. Any plot region below the mean value is not desirable.
- Lastly, the width of the plot should be as uniform as possible.
Right pic is the visualization of the cluster assignment.
From the example above, the following SA score graph is not desirable as few of the cluster have samples less than the mean SS score. Also, they are not uniformly distributed in width.
That’s all for the post. In my upcoming post I will go through a real life example of how to use Silhouette analysis for selecting the number of cluster for k-means clustering.
a) Silhouette analysis: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
b) Silhouette clustering: https://en.wikipedia.org/wiki/Silhouette_(clustering)