Tricks: Random Forest #Datascience
Question 1) What are the issues with random initialisation of centroids in K-means algorithm and how to overcome it?
Initiation of the centroids in a cluster is one of the most important steps of the K-means algorithm. Many times, random selection of initial centroid does not lead to an optimal solution. In order to overcome this problem, the algorithm is run multiple times with different random initialisations. The sum of squared errors (SSE) are calculated for different initial centroids. The set of centroids with the minimum SSE is selected. Even though this is a very simple method, it is not foolproof. The results of multiple random cluster initialisations will depend on the dataset and the number of clusters selected, however, that still will not give an optimum output every time.
The other method involves first selecting the centroid of all the data points. Then, for each successive initial centroid, select the point which is the farthest from the already selected centroid. This procedure will ensure that the selection is random, and the centroids are far apart. The disadvantage of this method is that calculating the farthest point will be expensive. In order to avoid this problem, initialisation is carried out on a subset of the dataset.
The other methods of handling this are bisecting K-means (this is covered in another question below) and taking care of the issues once clustering is done post processing.
Question 2) How are outliers handled by the K-means algorithm?
Handling of outliers differs from case to case. In some cases, it will provide very useful information, and in some cases, it will severely affect the results of the analysis. Having said that, let’s learn about some of the issues that arise due to outliers in the K-means algorithm below.
The centroids will not be a true representation of a cluster in the presence of outliers. The sum of squared errors (SSE) will also be very high in case of outliers. Small clusters will bond with outliers, which may not be the true representation of the natural patterns of clusters in data. Due to these reasons, outliers need to be removed before proceeding with clustering on the data.
Question 3) What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?
Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimise the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.
Question 4) What is Bisecting K-means?
In Bisecting K-means, all the data objects are split into two clusters. One of these clusters is selected and divided into two groups. This will continue until a pre-decided K number of clusters is formed. The selection of the cluster for splitting will depend on a number of factors such as the size of the cluster, the largeness of the SSE, etc. The choice of the splitting criteria will decide the type of clusters formed. The centroids of Bisecting K-means is used by the K-means algorithm as initial centroids for further refining the clusters. K-means algorithm will be able to find the local minimum by using the SSE as the objective function. In case of Bisecting K-means, the local optimum is related to the clusters being split and not the complete dataset. So, the final results obtained will not be ‘local’ to the dataset with respect to the total SSE. Bisecting K-means does not have issues with initialisation because when different splits are tried out, a split with a low SSE is preferred. Additionally, Bisecting K-means only deals with two centroids at a time. Using Bisecting K-means as a base for K-means improves the performance of the latter.
Question 5) Is K-means clustering suitable for all shapes and sizes of clusters?
K-means is not suitable for all shapes, sizes, and densities of clusters. If the natural clusters of a dataset are vastly different from a spherical shape, then K-means will face great difficulties in detecting it. K-means will also fail if the sizes and densities of the clusters are different by a large margin. This is mostly due to using SSE as the objective function, which is more suited for spherical shapes. SSE is not suited for clusters with non-spherical shapes, varied cluster sizes, and densities.
Best Regards,
Deepak Chaubey