Clustering a Large - Very Large Database
Algorithm vs Memory
Determining how to do the clustering is not straightforward - Hierarchical, Partitional, Categorical or Large DB - which one? I will talk about little bit on Large DB here.
Clustering is not only about grouping things in some fashion somehow. It is much beyond that. It is much beyond the oft-quoted example of grouping the 3 species of Iris flowers based on petal-sepal-length-width dataset introduced by Prof R A Fisher.When clustering is applied to a real-world database ( not the one cooked in a lab), we practitioners face many challenges. These potential representative challenges are related to - handling outliers, handling dynamic data, selecting input data, no one correct answer to a clustering problem, semantic meaning of each cluster among other problems. Unlike classification there does not exist any pre-assigned label to things / objects/ persons we want to cluster. Therefore, when the clustering is done, and a set of clusters are created - we do not know the exact meaning of each cluster. We need the help of Domain Expert to assign a semantic meaning to each cluster.
Most people do things in a lab in a controlled environment. Only wearer knows where the shoes pinches. Those who do not work with the customer simply do not have the idea about the size of the underlying data and complexity involved in the real customers situations.
In real-world, a Data Scientist has to deal with much more complexity of the data. Think of a telecommunication company where a customer today may belong to one group may belong to some other group tomorrow or some future point of time. And then, the customers of a service providers are making-receiving several calls, sending-receiving texts, using VAS etc, they belong to one age group today and the other tomorrow, move from one income group to the other, get married and divorced. Some attributes change slowly, the others change very fast - some attributes change less frequently and others change very frequently so on and so forth. We can also think about the customers of a Bank - swiping cards at the PoS, ATM etc. Then holding other type asset and liability accounts in the Bank
There are many questions that process inside the mind of a Data Scientist which automation may not be capable of handling. There are many classical methods for Clustering that work better for categorical variables and then there are other classical methods that work better for numerical variables. The clustering of a big database throws up many challenges which may include (not limited to) - time complexity, space complexity and dynamic nature of the data (before they are aggregated at the level of the object of clustering). Then there are problems of OUTLIERS - some outlier detection technique are based on statistical techniques, which may require one to perform DISCORDANCY TEST and many other steps before you may select one approach over the other. Now these tests are not realistic for real-world data as they may not follow well-defined data distributions. So, some clustering techniques do not perform well in the presence of outliers. And then there may be large amount of objects that one may like to cluster ( not as in the case of biological data where one processes tiny amount of data). In such cases increasing the SPACE will not solve the problem surrounding the clustering a large database with dynamic data. You will have to have some other non-traditional clustering approach. That actually necessitates for building new algorithms as opposed to adding more space to the disk - yeah, I know space is coming cheap. The SOS is, however, we do have algorithms like DBSCAN and BIRCH for taking care of the above challenges.
Exercise - Here is teaser for you. We have 9 balls and 4 bags, we want to place ball in such a way that each bag comprises odd numbers of balls. Clue - Hierarchical Clustering. You may like to leave the answer in the comment box.
Ramesh Baskaran Seema Kumar Ramesh kumar (SPSM)
Good comment Markus. I was not thinking that way, but it would work. My view was to put 3 balls in bag 1, 3 balls in bag 2, 2 balls in bag 3, one ball into bag 4, and the put bag 4 into bag 3. I think that it is the same general result, each bag has an odd number of balls. I guess the question now would become how many different permutations could one create for this exercise? You have one, I have one, that each meet the criteria. How many others are there with a 9 ball, 4 bag problem?
I enjoyed this article. Difficult teaser, but as long as we're putting bags in bags...put 3 bags into 1 and then place all 9 balls into our "multi-bag" single cluster - top down hierarchical clustering.
Good article Alok. Couple of possible answers to the exercise. The 4 bags could contain 3-3-3-0 balls. If every bag must have a ball, then 3-3-2-1 and put the bang with 1 ball into the bag with 2 balls.