Learnings from Data Science and ML 1

MLE.train
2 min readJan 13, 2021

--

KMeans is one of the most widely used clustering algorithm, here are two common questions wrt KMeans that I feel everyone should know.

Why does K-means suffer from the curse of dimensionality?

The reason why this happens is that as your dimensions keep on growing the minimum Euclidian distance between data points keeps on growing due to the increase in dimensions and as the dimensions tend to infinity the minimum distance converges. This means that the maximum and minimum distance between any two points will be the same.

This is why we use PCA or other dimensionality reduction techniques to choose the dimensions that contribute the maximum to the variance i.e. the dimensions in which the data is most spread out and thus provides useful information for clustering.

How to choose a “good” K in KMeans?

The ideal K to cluster our data points would be one in which all the data points of one cluster are close together and there are enough clusters to separate the data effectively. One way to find a suitable value is by calculating average centroid distance for each cluster.

This graph shows how average cluster distance varies as we increase k. We look for a elbow in this graph where the average distance stops reducing sharply with increase in k. This is the most suitable k because increasing it wont really capture any additional clusters in the data because it would have contributed to average distance being reduced sharply if the points were clustered far away from an existing centroid.

Some of these concepts have been taken from Udacity machine learning engineer nanodegree.

Udacity

--

--

No responses yet