K-Means Clustering Algorithm

Arun Kumar Borru

Published Apr 22, 2022

What is clustering?

Clustering is the process of grouping together objects so that those in the same grouping (cluster) have more similarities in common with those in their group than those in other groups.

Why would we want to cluster?

Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so that objects in the same group are more similar to each other than to those in other groups (clusters).

How would you determine clusters?

Probably the most well-known method is the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

The centroids have stabilized — there is no change in their values because the clustering has been successful.
The defined number of iterations has been achieved.

Wikipedia says!

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.

Uses!

The k-means is used to find groups that have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

Behavioral segmentation:
Segment by purchase history
Segment by activities on application, website, or platform
Define personas based on interests
Create profiles based on activity monitoring

The Most waited for Algorithm K-means

Step 1: Import libraries

import pandas as p
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inlined

Step 2: Generate random data

We will be generating random numbers (data) with the help of NumPy in python

X= -2 * np.random.rand(100,2
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()

This will be a sample output

Step 3: Use Scikit-Learn and find the centroid

from sklearn.cluster import KMeans

Kmean = KMeans(n_clusters=2)
Kmean.fit(X)

Kmean.cluster_centers_

Step 4: Display

plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show())

here is some sample output

Here is Full Code for Future References

import pandas as p
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2)
Kmean.fit(X)
Kmean.cluster_centers_
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()d

K-Means Clustering Algorithm

Arun Kumar Borru

What is clustering?

Why would we want to cluster?

How would you determine clusters?

How the K-means algorithm works

Video Blog

References

More articles by Arun Kumar Borru

Others also viewed

Classification of Iris Dataset using Logistic Regression

Clustering with K-Means

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

A Comprehensive Guide to Data Mining Techniques

K-Mean Clustering and it's Real Use-Cases in Security Domain

The Art of the Science: Leonardo’s Lesson for Today’s Data Scientists

DATA MINING

K-means Clustering: Applications and Real-world Use Cases

Model Building & Performance Testing

Big Data and data mining

Explore content categories

What is clustering?

Why would we want to cluster?

How would you determine clusters?

How the K-means algorithm works

Video Blog

References

More articles by Arun Kumar Borru

TIME SERIES PREDICTION

Generic Types

Others also viewed

Classification of Iris Dataset using Logistic Regression

Clustering with K-Means

Adaptive Hierarchical Clustering, Gaussian Mixture Models (GMM), and Expectation-Maximization

A Comprehensive Guide to Data Mining Techniques

K-Mean Clustering and it's Real Use-Cases in Security Domain

The Art of the Science: Leonardo’s Lesson for Today’s Data Scientists

DATA MINING

K-means Clustering: Applications and Real-world Use Cases

Model Building & Performance Testing

Big Data and data mining

Explore content categories