Text Mining- Think beyond word-cloud
A few weeks back, i was having a chat with my client for whom i was building a text mining application. He was mostly satisfied with the kind of work going on when he asked what sort of visualization i was planning to deploy and I said ‘wordcloud’. As contrary to what I expected, he did not seem to like the idea. What he said diplomatically and in business language could appear in plain English as ‘wordclouds are pretty obvious and commonplace, can we do better?’
It is true that when it comes to text (document) visualization, our options become very limited. On the other hand, we have some very powerful algorithms to mine texts. Today, we have methods using which we can extract important, meaningful information from hundreds and thousands of documents. We are no longer limited to removing stop words, stemming, lemmatization and wrapping words only. Today given a document, we can make it learn to generate a summary, we can find mutually contradictory sentences or paragraphs, and there are other widely known algorithms that we can deploy to get a lot of information. For example, NLTK package of python (http://www.nltk.org/) and NLP from Stanford (http://nlp.stanford.edu/). Both of these are really cool and can help do a lot of things for effective text mining, NLP is particularly worth mentioning for its advantage of providing ‘Named Entity Recognition (NER)’ and dependencies feature. To try out its cool features go to http://nlp.stanford.edu:8080/corenlp/process, enter a few sentences and click submit to see its power.
But, as i was saying, when it comes to visualization, we do not have something that people can get ready to pay. Clients are hard to convince. Some efforts made to visualize parts of document intelligently by mapping subheadings. Some other efforts have been given in mapping adjacency, proximity, rarity and semantics of the words. However slightly modified but these are also wordclouds only but these have been largely successful. Clients did not only like the visuals but also found them more meaningful and action oriented. As next step, when we need to bring more compactness and low diffusion, we can try something which i call ‘meaning based visualization’. The idea of a meaning based visualization is simple and quite useful. Here, instead of chopping the document at word level, we can parse it at sentence level, say, after every full stop. Next, we can build a predictive model to estimate sentences with similar, opposite or complimentary meanings. We can ‘tie’ them together based on these attributes or create additional properties like replacing certain words with synonyms to make it compact. Certainly, more intelligent than a usual wordcloud, but it is not as simple to build as a wordcloud. Currently, there is no package or module available which can do this and hence can be quite daunting to code everything from scratch for people who are not seasoned programmers.
Or maybe, we can wait till somebody comes with an open source package.
PS:- The word-cloud image used in this article is built from http://www.wordle.net/create from the text of this article.