Text Mining: Why context is king
Natural Language Processing has been around for decades, but failed to reach the mainstream. Then cloud/parallel computing arrived and spurred an interest in sleek and intelligent algorithms to tackle huge volumes of unstructured data. This produced a seismic shift in the field and opened a chasm of possibilities. This is not only true of huge online sources, but also for what is possible with smaller sources of text such as customer feedback.
The greatest leap has been the ability to consider the context of the words being used and this has led to the coining of the term ‘from strings to things.’
Traditionally, if I wanted to find all mentions relating to the Apple brand, I would have struggled to find a way to search for it. I couldn't use “apple” and “brand” because people simply don’t talk in those terms. A better approach would have been to search for where “apple” is mentioned but where “food” or “fruit” are not. However I would still receive a lot of nonsense which has nothing to do with what I'm looking for – apple tree… the apple of my eye. The Big Apple would pop up too. Now that’s only one search term.
It’s easy to see why doing this the old way becomes inaccurate, inconsistent and hugely time consuming, very quickly.
At the end of the process I would be reliant on a bunch of search terms and maybe a code frame, allocating each term into a bucket of similar things. Then what happens if the underlying data changes? My static set up can’t catch concepts other than those which were so narrowly specified. It might appear to work, but it doesn't.
Fortunately, Natural Language Processing doesn't work anything like that, at least not any more. We can now get to the context of words! We can see patterns in the proximity of words – that is, which words are typically used together across hundreds or thousands of pages. So when people speak of iOS and the new iPhone or the App Store; I know immediately what brand they’re talking about because these words tend to be used together. No mention of Apple is required. I don’t even have to update my code frame to include the Apple Watch Sport.
Now here’s where things really hot up, because the implication is this: Algorithms can learn that when “glass” or perhaps slang like “glasses” are mentioned, it’s relevant to Google and not Apple. All this can now be done without needing to set up a new search or even know about any of these phrases. In other words, exploration is built in, because the concepts are informed by the data itself, and not what I or a team of coders think the data should say. Using another example, algorithms can tell the difference between "Arsenal" the football team, and the military building with lots of guns in it. When “gunners” are mentioned, it has to do with the football team and not an army. When “barmy” is put with army, we’re talking about something else entirely. Again, no search terms nor manual code frame. Neat huh?
Moreover, a collection of search terms can’t tell me if the text is speaking positively or negatively about Apple or the Gunners. When I search for “like” and “Apple” or “won” and “Arsenal” - I'm still stuck. Similarly, I can’t build a code frame to capture each and every combination of words. I need to consider the emotive context which surrounds, and it could literally be anything. This is one of the many reasons text mining models need to be built from scratch.
Without context, it is too easy to get the wrong end of the stick.
Thankfully, we no longer need to rely on static, biased, hard-wired code frames. Both the technology and computing power are already available to read terabytes of text. It is possible to get the whole picture objectively then reveal concepts that really are in the text, even when the words we’re expecting are not!
Am loving your work. Keep posting.