The Power of Text Mining & Analytics
It blows one’s mind, but as the runaway exponential growth of all kinds of data continues to grow as the 21st century unfolds, about 85% of that is unstructured data – as estimated by the MIT Sloan School of Business. This may consist of images, multimedia, or text – and it’s the last one we’re focusing on here. Most analytics software packages come with a text mining feature, as well they should; R and Python also have various capabilities as such.
The categories of TMA (Text Mining & Analytics) are varied: document classification, sentiment analysis, author/style analysis, text topic mining, text term clustering, and term weighting and scoring. One can derive text topic or cluster variables that may be used as inputs to traditional analytics models, such as logistic regression or decision trees.
In an early case of text mining for author/style analysis in the forensic category, it helped bring down the Unabomber (Ted Kaczynski). https://www.fbi.gov/history/famous-cases/unabomber The FBI received a copy of his manifesto (anonymously), on the ills of modern society and technological development. In an effort to flush him out, the article was published by The Washington Post at the behest of the FBI. Many people suggested possible suspects, but the one that stood out was from the Unabomber’s own brother, David Kaczynski. David provided letters and documents written by his brother to the FBI, and text analysts were able to determine that the author of both the samples and the manifesto were the same[1]. That was in the mid-1990s; imagine the possibilities today.
Text mining programs focus on terms that are not highly common, nor are they extremely rare; there’s a “sweet spot” between these two extremes, as conveyed by a Zipf curve. This graphs the inverse relationship between the frequency of a word and its rank.
So, the bend in the Zipf curve is that “sweet spot” or the Goldilocks zone as some would say, where terms have the most value. You’d get a much higher value from multiplying #Docs= 200 * Rank = 1000 (which comes to 200,000) than you would from 800 * 10 or 20 * 10,000.
The standard text mining algorithm is TF-IDF, or term frequency – inverse document frequency. This evaluates all terms over the collection of text records (or documents as the case may be), contrasted with the inverse of how many documents that term appears in. So IDF provides a measure of the worth of information that a given document provides, as to whether this term pattern is common (or rare) across all documents.
Now where text mining has great capability is in distinguishing between the same word as different parts of speech. A word may be a noun, verb, adjective, adverb, proper noun, etc. However, some words like “conflict” can be a noun (obvious), a verb (her priorities conflicted), or an adjective (in a conflict zone). Or, when talking about oil (singular), it typically refers to the energy sector; but if a text record uses oils (plural), it’s more likely referring to the arts or to homeopathic remedies.
So, text mining and analytics software can “tag and bag” these as separate term types, and project them into the Zipf space accordingly. Thus, in a series of short U.N. papers, conflict as a verb stem would have a higher rank number but a lower frequency. The other parts of speech for “conflict” may be found in the “sweet spot” of the Zipf curve, or they may be overly frequent to be deemed of any use for highlighting a given record.
The text mining software is also able to perform concept linkage, which depicts interconnected words depending on their relatively proximity i.e. co-occurrence of usage with each other. Often, when you scrutinize on key terms in the “sweet spot” in terms of their concept linkage, you are likely to find other words that are similarly weighted.
One highly useful application of text mining is TTA, text topic analysis. This can either be applied automatically, to generate topics based on clustering of key words in close proximity to each other, or the software will allow you to create custom text topics in which terms that you specify can be clustered together. Maybe you wish to create a custom text topic for fraud suspicion, or for environmental and social responsibility, or for health and wellness. It all depends on the overarching domain. From that point, once you have trained a model to auto-classify records, you can feed it new data on which to automatically perform TTM classification.
In one case study, a forensic technology and data science team was investigating the records of a large global corporation for potentially improper payments (PIPs) in regards to suspected corrupt practices overseas. Using random forest, neural network, and other machine learning techniques, the team was able to tune the model to accord more weight to variables driving the 450 PIPs. Specifically, for text variables, it turned out that when the phrases “facilitation payment”, “help fee”, “round dollar amounts”, and “statistically anomalous payments” were present in the transactions[2], these were red flags.
Recommended by LinkedIn
Now here’s where things really take an interesting turn: typically, a text mining program will remove “superfluous” common parts-of-speech like personal pronouns, prepositions, or conjunctions. BUT…but but but… when doing white collar crime analysis in text mining, the over-use of first-person pronouns can be a big red flag. Plenty of people didn’t see that coming. In Wharton professor Adam Grant’s book Give and Take, he speaks to the situation where a CEO regularly uses personal pronouns such as “I”, “me”, and “my” in corporate reports like information circulars. He invokes the example of disgraced former Enron chairman Ken Lay, whose corporate reports were peppered with these terms of ego. His photo was also approximately five times the size of a typical CEO photo in a corporate report[3].
Yes, a textbook case of despicable narcissism. And these are the patterns that reveal whether a business might go south from rampant fraud and recklessness. So it just goes to show, in some cases the discarding of seemingly “noisy” terms like personal pronouns is not such a good idea–because they hold more signal than you might think.
As a last consideration to this article, when modeling with text mining outputs, you would do well to consider the interpretability of your model. If you want to persuade a given audience, or ensure that your model can be readily defended if someone challenges its merits, this is a must. With this in mind, text mining outputs – be they from text topic modeling or text clustering – will have “raw” variables for each cluster or topic, and binary variables for each cluster or topic. The binary variables are more simplified: does this document or record meet the minimum threshold to belong to Topic ‘n’, or not? Whereas raw variables are projected into a term vector space, and given continuous decimal numbers. That gets a little more complicated.
So, choose your text mining and modeling methods accordingly – and the great news is that models employing text variables can be combined with more structured variables. It doesn’t have to be one type or the other! On that note, stay tuned for my next article on hybrid approaches to machine learning.
[1] The FBI official website. History à Famous Cases and Criminals à The Unabomber. https://www.fbi.gov/history/famous-cases/unabomber
[2] Walden, Vincent M. “Using Technology-Assisted Review to Uncover Suspicious Transactions”. FRAUD Magazine (62-63), Vol. 37, No. 67, Nov/Dec. 2022.
[3] Grant, Adam. (2014) Give and Take: Why Helping Others Drives Our Success. (36-37) Published by Penguin Books.