K-Means Clustering Algorithm

K-Means Clustering Algorithm

What is clustering?

Clustering is the process of grouping together objects so that those in the same grouping (cluster) have more similarities in common with those in their group than those in other groups.

Why would we want to cluster?

Clustering is important in data analysis and data mining applications. It is the task of grouping a set of objects so that objects in the same group are more similar to each other than to those in other groups (clusters).

How would you determine clusters?

Probably the most well-known method is the elbow method, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

Wikipedia says!

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cellsk-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.

Uses!

The k-means is used to find groups that have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

  • Behavioral segmentation:
  • Segment by purchase history
  • Segment by activities on application, website, or platform
  • Define personas based on interests
  • Create profiles based on activity monitoring


The Most waited for Algorithm K-means

Step 1: Import libraries

import pandas as p
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inlined

Step 2: Generate random data

We will be generating random numbers (data) with the help of NumPy in python

X= -2 * np.random.rand(100,2
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()

This will be a sample output

No alt text provided for this image

Step 3: Use Scikit-Learn and find the centroid

from sklearn.cluster import KMeans

Kmean = KMeans(n_clusters=2)
Kmean.fit(X)

Kmean.cluster_centers_



Step 4: Display

plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show())

here is some sample output

No alt text provided for this image

Here is Full Code for Future References

import pandas as p
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline
X= -2 * np.random.rand(100,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
plt.scatter(X[ : , 0], X[ :, 1], s = 50, c = ‘b’)
plt.show()
from sklearn.cluster import KMeans
Kmean = KMeans(n_clusters=2)
Kmean.fit(X)
Kmean.cluster_centers_
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)
plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)
plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)
plt.show()d

Video Blog


References

Google Colab Test Excution Link



To view or add a comment, sign in

More articles by Arun Kumar Borru

  • TIME SERIES PREDICTION

    What is Arima? Autoregressive Integrated Moving Average (ARIMA) ARIMA is an acronym that stands for AutoRegressive…

  • Generic Types

    Generic types Definition: “A generic type is a generic class or interface that is parameterized over types.” or A…

Others also viewed

Explore content categories