Tricks: Random Forest #Datascience

Deepak Chaubey

Published Feb 11, 2019

Question 1) What are the issues with random initialisation of centroids in K-means algorithm and how to overcome it?

Initiation of the centroids in a cluster is one of the most important steps of the K-means algorithm. Many times, random selection of initial centroid does not lead to an optimal solution. In order to overcome this problem, the algorithm is run multiple times with different random initialisations. The sum of squared errors (SSE) are calculated for different initial centroids. The set of centroids with the minimum SSE is selected. Even though this is a very simple method, it is not foolproof. The results of multiple random cluster initialisations will depend on the dataset and the number of clusters selected, however, that still will not give an optimum output every time.

The other method involves first selecting the centroid of all the data points. Then, for each successive initial centroid, select the point which is the farthest from the already selected centroid. This procedure will ensure that the selection is random, and the centroids are far apart. The disadvantage of this method is that calculating the farthest point will be expensive. In order to avoid this problem, initialisation is carried out on a subset of the dataset.

The other methods of handling this are bisecting K-means (this is covered in another question below) and taking care of the issues once clustering is done post processing.

Question 2) How are outliers handled by the K-means algorithm?

Handling of outliers differs from case to case. In some cases, it will provide very useful information, and in some cases, it will severely affect the results of the analysis. Having said that, let’s learn about some of the issues that arise due to outliers in the K-means algorithm below.

The centroids will not be a true representation of a cluster in the presence of outliers. The sum of squared errors (SSE) will also be very high in case of outliers. Small clusters will bond with outliers, which may not be the true representation of the natural patterns of clusters in data. Due to these reasons, outliers need to be removed before proceeding with clustering on the data.

Question 3) What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?

Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimise the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.

Question 4) What is Bisecting K-means?

In Bisecting K-means, all the data objects are split into two clusters. One of these clusters is selected and divided into two groups. This will continue until a pre-decided K number of clusters is formed. The selection of the cluster for splitting will depend on a number of factors such as the size of the cluster, the largeness of the SSE, etc. The choice of the splitting criteria will decide the type of clusters formed. The centroids of Bisecting K-means is used by the K-means algorithm as initial centroids for further refining the clusters. K-means algorithm will be able to find the local minimum by using the SSE as the objective function. In case of Bisecting K-means, the local optimum is related to the clusters being split and not the complete dataset. So, the final results obtained will not be ‘local’ to the dataset with respect to the total SSE. Bisecting K-means does not have issues with initialisation because when different splits are tried out, a split with a low SSE is preferred. Additionally, Bisecting K-means only deals with two centroids at a time. Using Bisecting K-means as a base for K-means improves the performance of the latter.

Question 5) Is K-means clustering suitable for all shapes and sizes of clusters?

K-means is not suitable for all shapes, sizes, and densities of clusters. If the natural clusters of a dataset are vastly different from a spherical shape, then K-means will face great difficulties in detecting it. K-means will also fail if the sizes and densities of the clusters are different by a large margin. This is mostly due to using SSE as the objective function, which is more suited for spherical shapes. SSE is not suited for clusters with non-spherical shapes, varied cluster sizes, and densities.

Best Regards,

Deepak Chaubey

To view or add a comment, sign in

Tricks: Random Forest #Datascience

Deepak Chaubey

More articles by Deepak Chaubey

Others also viewed

Feature Selection: Random Forest vs. SHAP — Let’s Settle This!

State of the Union Analysis using Wolfram Mathematica

Sorting in Data Structure: Categories & Types [With Examples]

From Statistics to Machine Learning

Finding the optimal number of clusters for K-Means through Elbow method using a mathematical approach compared to graphical approach

Demystifying Random Forests Algorithm

Custom Object Detector

Exploring Regression Models with EDA: A Mini Project on Car Price Prediction

Beyond the Vector: Why Knowledge Graphs are the "Maturity Phase" of RAG in 2026

How to choose hyperparameters for a model with GridSearchCV.

Explore content categories

More articles by Deepak Chaubey

The Enterprise AI Token Economics Playbook

Most AI ROI Numbers Are Fiction. Here’s How to Measure Real Production-Grade AI ROI Impact

The AI Autonomy Paradox: Why Agentic AI's Power Is Keeping CXO (Especially in Startups) Awake at Night

Machine Learning: Key error definitions from Machine Learning:

Principle Assumptions of Linear Regression model

6 new Technology buzzword you must know in 2020

Machine learning Interview questions

The data pipeline: why do we need it?How its different from ETL

The Seven Practice Areas of Text Analytics

Yes!!.......200% salary hike is not a myth!!

Others also viewed

Feature Selection: Random Forest vs. SHAP — Let’s Settle This!

State of the Union Analysis using Wolfram Mathematica

Sorting in Data Structure: Categories & Types [With Examples]

From Statistics to Machine Learning

Finding the optimal number of clusters for K-Means through Elbow method using a mathematical approach compared to graphical approach

Demystifying Random Forests Algorithm

Custom Object Detector

Exploring Regression Models with EDA: A Mini Project on Car Price Prediction

Beyond the Vector: Why Knowledge Graphs are the "Maturity Phase" of RAG in 2026

How to choose hyperparameters for a model with GridSearchCV.

Explore content categories