In the post, I will give an introduction to k-means algorithm.
For first time readers, k-means is a clustering algorithm which is used to find similar clusters for a set of n elements. Number of clusters `k` required is provided an as input from the user.
K-means is an iterative algorithm where the task is to assign each sample to a cluster center such that the distance between the sample and the cluster center is minimized. For a given n samples, there could be many such cluster assignments. Hence, the k-means does not always lead to global minimum.
Given a set of n samples (X1, X2, X3 … Xn) where each observation is d-dimension real vector and k cluster the task is to partition the n samples in k sets S = ( S_1, S_2, S_3, … S_K) so as to minimize the sum of square of the distance between the sample and its cluster assignment.
inertia = 0 for i in K: for each x in S_i: inertia += distance (x - u_i) * (x - u_i) where u_i is the mean point of cluster S_i the task is to find the K Sets such that the value of inertia is minimized.
where μi is the mean of points in Si
One should note that k-means is a two set process:
a) Assignment step: where each sample is assigned to its closest cluster
b) Update step: Where the centroid of each cluster is updated by taking a mean of all the samples in that cluster.
Scikit-lean and k-means: Scikit-learn has a k-means implementation which one can use to perform clustering.
km = class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1)
The main input parameters to play around with are:
- n_clusters = which defines the k numbers of clusters you want.
- n_init = Number of times k-means will run with different seeds. This is helpful since k-means can lead to local minimum.
- init: Initialization strategy. Few possible options are ‘k-means++’, ‘random’. K-means++ selects the k cluster is a smarter way to speed up convergence, whereas random selects at random k samples from the dataset.
- max_iter: Total number of times the k-means algorithm will run maximum number of times for a given run.
After convergence, the k-means algorithm returns the following parameters:
- labels_: labels of each input sample
- interia_: The final value which we were minimizing.
- cluster_centers_ [k, n_features]: An array consisting of the coordinates of the centroid for each cluster. The dimension of the centroid (features) is same as the dimension of the each of the input sample(d-dimension).
In the next post(part 2), I will discuss about various “Clustering evaluation techniques“. I will touch upon the following topics
- Adjusted Rand index
- Mutual Information based scores
- Homogeneity, completeness and V-measure
- Silhouette Coefficient
The last one Silhouette Coefficient is very helpful in deciding the best cluster assignment when you don’t have the true labels of the cluster. The other three depends upon the true labels of the cluster to be present. More of them in next post.