Clustering a Large - Very Large Database

Clustering a Large - Very Large Database

Algorithm vs Memory


Determining how to do the clustering is not straightforward - Hierarchical, Partitional, Categorical or Large DB - which one? I will talk about little bit on Large DB here.


Clustering is not only about grouping things in some fashion somehow. It is much beyond that. It is much beyond the oft-quoted example of grouping the 3 species of Iris flowers based on petal-sepal-length-width dataset introduced by Prof R A Fisher.When clustering is applied to a real-world database ( not the one cooked in a lab), we practitioners face many challenges. These potential representative challenges are related to - handling outliers, handling dynamic data, selecting input data, no one correct answer to a clustering problem, semantic meaning of each cluster among other problems. Unlike classification there does not exist any pre-assigned label to things / objects/ persons we want to cluster. Therefore, when the clustering is done, and a set of clusters are created - we do not know the exact meaning of each cluster. We need the help of Domain Expert to assign a semantic meaning to each cluster.

Most people do things in a lab in a controlled environment. Only wearer knows where the shoes pinches. Those who do not work with the customer simply do not have the idea about the size of the underlying data and complexity involved in the real customers situations.


In real-world, a Data Scientist has to deal with much more complexity of the data. Think of a telecommunication company where a customer today may belong to one group may belong to some other group tomorrow or some future point of time. And then, the customers of a service providers are making-receiving several calls, sending-receiving texts, using VAS etc, they belong to one age group today and the other tomorrow, move from one income group to the other, get married and divorced. Some attributes change slowly, the others change very fast - some attributes change less frequently and others change very frequently so on and so forth. We can also think about the customers of a Bank - swiping cards at the PoS, ATM etc. Then holding other type asset and liability accounts in the Bank


There are many questions that process inside the mind of a Data Scientist which automation may not be capable of handling. There are many classical methods for Clustering that work better for categorical variables and then there are other classical methods that work better for numerical variables. The clustering of a big database throws up many challenges which may include (not limited to) - time complexity, space complexity and dynamic nature of the data (before they are aggregated at the level of the object of clustering). Then there are problems of OUTLIERS - some outlier detection technique are based on statistical techniques, which may require one to perform DISCORDANCY TEST and many other steps before you may select one approach over the other. Now these tests are not realistic for real-world data as they may not follow well-defined data distributions. So, some clustering techniques do not perform well in the presence of outliers. And then there may be large amount of objects that one may like to cluster ( not as in the case of biological data where one processes tiny amount of data). In such cases increasing the SPACE will not solve the problem surrounding the clustering a large database with dynamic data. You will have to have some other non-traditional clustering approach. That actually necessitates for building new algorithms as opposed to adding more space to the disk - yeah, I know space is coming cheap. The SOS is, however, we do have algorithms like DBSCAN and BIRCH for taking care of the above challenges.


Exercise - Here is teaser for you. We have 9 balls and 4 bags, we want to place ball in such a way that each bag comprises odd numbers of balls. Clue - Hierarchical Clustering. You may like to leave the answer in the comment box.


Good comment Markus. I was not thinking that way, but it would work. My view was to put 3 balls in bag 1, 3 balls in bag 2, 2 balls in bag 3, one ball into bag 4, and the put bag 4 into bag 3. I think that it is the same general result, each bag has an odd number of balls. I guess the question now would become how many different permutations could one create for this exercise? You have one, I have one, that each meet the criteria. How many others are there with a 9 ball, 4 bag problem?

I enjoyed this article. Difficult teaser, but as long as we're putting bags in bags...put 3 bags into 1 and then place all 9 balls into our "multi-bag" single cluster - top down hierarchical clustering.

Good article Alok. Couple of possible answers to the exercise. The 4 bags could contain 3-3-3-0 balls. If every bag must have a ball, then 3-3-2-1 and put the bang with 1 ball into the bag with 2 balls.

To view or add a comment, sign in

More articles by Alok Kumar

  • Impact Analysis of Black Spot Rectification

    An Approach Papaer Co-created by TRW, MoRTH, GoI in collaboration with IBM Client Engineering, IBM India Overview: The…

    2 Comments
  • Tracking Rumour Mongers in Covid-19 Times

    In response to rumours on and around the worldwide pandemic the law enforcing agency can leverage the Data Science…

  • SPSS Data Science in Insurance Verticals

    ===================== Customer Retention & Growth ================= IBM SPSS solutions for customer retention and…

  • Predictive Analytics using IBM SPSS Modeler in a Steel Manufacturing Company

    The sulphur content is detrimental to the quality of steel. The Titanic ship sank on April 15, 1912 because the sulphur…

  • Analytical Framework for Emergency Management

    Sensing to Prepare, Respond and Recover in that moment of the truth. Abstract— when a disaster strikes - in that moment…

  • Big Data - Click & Mortar

    I have seen several examples and illustrations on the need to create the Big Data Systems to unlock the power of data…

    1 Comment
  • SPSS Analytic Server

    When IBM SPSS Modeler is combined with IBM SPSS Analytic Server, analysts can develop and deploy predictive analytics…

    4 Comments
  • Automation of Predictive Analytics - A False Promise

    Let me start with a Data Mining Quiz - Any idea why the Titanic sank? I will reveal the answer in an oblique manner by…

    7 Comments
  • I love IBM

    Denver won IBM's Smarter Cities Challenge program last year for its proposal to tackle homelessness. IBM will collect…

  • SparkR DataFrames

    SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 1.

Others also viewed

Explore content categories