Text Mining- Think beyond word-cloud

Text Mining- Think beyond word-cloud

A few weeks back, i was having a chat with my client for whom i was building a text mining application.  He was mostly satisfied with the kind of work going on when he asked what sort of visualization i was planning to deploy and I said ‘wordcloud’. As contrary to what I expected, he did not seem to like the idea. What he said diplomatically and in business language could appear in plain English as ‘wordclouds are pretty obvious and commonplace, can we do better?’

It is true that when it comes to text (document) visualization, our options become very limited. On the other hand, we have some very powerful algorithms to mine texts. Today, we have methods using which we can extract important, meaningful information from hundreds and thousands of documents. We are no longer limited to removing stop words, stemming, lemmatization and wrapping words only. Today given a document, we can make it learn to generate a summary, we can find mutually contradictory sentences or paragraphs, and there are other widely known algorithms that we can deploy to get a lot of information. For example, NLTK package of python (http://www.nltk.org/) and NLP from Stanford (http://nlp.stanford.edu/). Both of these are really cool and can help do a lot of things for effective text mining, NLP is particularly worth mentioning for its advantage of providing ‘Named Entity Recognition (NER)’ and dependencies feature. To try out its cool features go to http://nlp.stanford.edu:8080/corenlp/process, enter a few sentences and click submit to see its power.

But, as i was saying, when it comes to visualization, we do not have something that people can get ready to pay. Clients are hard to convince. Some efforts made to visualize parts of document intelligently by mapping subheadings.  Some other efforts have been given in mapping adjacency, proximity, rarity and semantics of the words. However slightly modified but these are also wordclouds only but these have been largely successful. Clients did not only like the visuals but also found them more meaningful and action oriented. As next step, when we need to bring more compactness and low diffusion, we can try something which i call ‘meaning based visualization’. The idea of a meaning based visualization is simple and quite useful. Here, instead of chopping the document at word level, we can parse it at sentence level, say, after every full stop. Next, we can build a predictive model to estimate sentences with similar, opposite or complimentary meanings. We can ‘tie’ them together based on these attributes or create additional properties like replacing certain words with synonyms to make it compact.  Certainly, more intelligent than a usual wordcloud, but it is not as simple to build as a wordcloud. Currently, there is no package or module available which can do this and hence can be quite daunting to code everything from scratch for people who are not seasoned programmers.

Or maybe, we can wait till somebody comes with an open source package.

 

 PS:- The word-cloud image used in this article is built from http://www.wordle.net/create from the text of this article.

  

 

To view or add a comment, sign in

More articles by Anuj Kumar

  • 014: ML concept revision part-7

    Support, Confidence and Lift: Support: The number of records in the training data set that satisfy the rule…

  • 013: ML Concept Revision Part-6

    R^2, ROC and AIC In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is a…

  • 012: ML Concept Revision Part-5

    Kernel Trick Kmeans: K Means The aim of K-means: We want to group the items into k clusters such that all items in same…

  • 011: ML Concept Revision Part-4

    Expectation Maximization: follow this link Gradient Boosting Machine: Read Wikipedia article on Gradient Boosting as…

  • 010: ML Concept Revision Part-3

    Difference between AdaBoost and GBM Both methods use a set of weak learners. They try to boost these weak learners into…

  • 009: ML concept revision Part-2

    Boosting: Boosting can be seen in direct comparison with bagging. These are two different approaches of the bias…

  • 008 : 60 days training program for Data Science

    In continuation to article 6 and 7, here goes the curriculum for advanced level candidates: Advanced course in Machine…

  • 007 : 40 days training program for Data Science

    in continuation to article 6, here goes course for intermediate level candidates: Intermediate course in Data Science…

  • 006 : 25 days training program for Data Science

    Once upon a time, i used to do training programs on data science for individuals and corporate. Here is the curriculum…

  • 005: ML concept revision Part-1

    Long ago, i had created so many text documents each containing one or two ML concepts. I did that to help me revise the…

Others also viewed

Explore content categories