Clustering a Large - Very Large Database

Alok Kumar

Published May 1, 2016

Algorithm vs Memory

Determining how to do the clustering is not straightforward - Hierarchical, Partitional, Categorical or Large DB - which one? I will talk about little bit on Large DB here.

Clustering is not only about grouping things in some fashion somehow. It is much beyond that. It is much beyond the oft-quoted example of grouping the 3 species of Iris flowers based on petal-sepal-length-width dataset introduced by Prof R A Fisher.When clustering is applied to a real-world database ( not the one cooked in a lab), we practitioners face many challenges. These potential representative challenges are related to - handling outliers, handling dynamic data, selecting input data, no one correct answer to a clustering problem, semantic meaning of each cluster among other problems. Unlike classification there does not exist any pre-assigned label to things / objects/ persons we want to cluster. Therefore, when the clustering is done, and a set of clusters are created - we do not know the exact meaning of each cluster. We need the help of Domain Expert to assign a semantic meaning to each cluster.

Most people do things in a lab in a controlled environment. Only wearer knows where the shoes pinches. Those who do not work with the customer simply do not have the idea about the size of the underlying data and complexity involved in the real customers situations.

In real-world, a Data Scientist has to deal with much more complexity of the data. Think of a telecommunication company where a customer today may belong to one group may belong to some other group tomorrow or some future point of time. And then, the customers of a service providers are making-receiving several calls, sending-receiving texts, using VAS etc, they belong to one age group today and the other tomorrow, move from one income group to the other, get married and divorced. Some attributes change slowly, the others change very fast - some attributes change less frequently and others change very frequently so on and so forth. We can also think about the customers of a Bank - swiping cards at the PoS, ATM etc. Then holding other type asset and liability accounts in the Bank

There are many questions that process inside the mind of a Data Scientist which automation may not be capable of handling. There are many classical methods for Clustering that work better for categorical variables and then there are other classical methods that work better for numerical variables. The clustering of a big database throws up many challenges which may include (not limited to) - time complexity, space complexity and dynamic nature of the data (before they are aggregated at the level of the object of clustering). Then there are problems of OUTLIERS - some outlier detection technique are based on statistical techniques, which may require one to perform DISCORDANCY TEST and many other steps before you may select one approach over the other. Now these tests are not realistic for real-world data as they may not follow well-defined data distributions. So, some clustering techniques do not perform well in the presence of outliers. And then there may be large amount of objects that one may like to cluster ( not as in the case of biological data where one processes tiny amount of data). In such cases increasing the SPACE will not solve the problem surrounding the clustering a large database with dynamic data. You will have to have some other non-traditional clustering approach. That actually necessitates for building new algorithms as opposed to adding more space to the disk - yeah, I know space is coming cheap. The SOS is, however, we do have algorithms like DBSCAN and BIRCH for taking care of the above challenges.

Exercise - Here is teaser for you. We have 9 balls and 4 bags, we want to place ball in such a way that each bag comprises odd numbers of balls. Clue - Hierarchical Clustering. You may like to leave the answer in the comment box.

Alok Kumar 6y

Ramesh Baskaran Seema Kumar Ramesh kumar (SPSM)

Tim Daciuk 10y

Good comment Markus. I was not thinking that way, but it would work. My view was to put 3 balls in bag 1, 3 balls in bag 2, 2 balls in bag 3, one ball into bag 4, and the put bag 4 into bag 3. I think that it is the same general result, each bag has an odd number of balls. I guess the question now would become how many different permutations could one create for this exercise? You have one, I have one, that each meet the criteria. How many others are there with a 9 ball, 4 bag problem?

2 Reactions

Tyler Hajzus 10y

I enjoyed this article. Difficult teaser, but as long as we're putting bags in bags...put 3 bags into 1 and then place all 9 balls into our "multi-bag" single cluster - top down hierarchical clustering.

1 Reaction

Tim Daciuk 10y

Good article Alok. Couple of possible answers to the exercise. The 4 bags could contain 3-3-3-0 balls. If every bag must have a ball, then 3-3-2-1 and put the bang with 1 ball into the bag with 2 balls.

Clustering a Large - Very Large Database

Alok Kumar

More articles by Alok Kumar

Others also viewed

Enabling Agentic Data Science: The Model Context Protocol

Demystifying Data Storage: A Dive into Vector Databases

Data Science Bridges the Gap Between Business Challenges and Competitive Solutions

5 Tips for Better Analytics

The data sourcing pipeline is the real place of the (data) scientist

Why Data Modelling Is Becoming More About Meaning Than Structure

10 Interesting Data Science Facts You Need to Know

Data Science / Data Artist - who do we need?

Liquid Clustering with Databricks Delta Lake

The Case for Prescriptive Data Science

Explore content categories

More articles by Alok Kumar

Impact Analysis of Black Spot Rectification

Tracking Rumour Mongers in Covid-19 Times

SPSS Data Science in Insurance Verticals

Predictive Analytics using IBM SPSS Modeler in a Steel Manufacturing Company

Analytical Framework for Emergency Management

Big Data - Click & Mortar

SPSS Analytic Server

Automation of Predictive Analytics - A False Promise

I love IBM

SparkR DataFrames