The Mathematics of Language

The Mathematics of Language

Mathematics is objective and precise while language is subjective and ambiguous. Math is driven by the left brain while language is driven by the right brain. Math is science while language is art! But what we are seeing is that despite these differences, math and language are meeting. This surprise fusion of these diametrically opposites is happening in the world of Natural Language Processing (NLP).

An excellent example of NLP is the feature of suggestions given while typing in a messenger such as WhatsApp. The moment I type ‘Do’, WhatsApp suggests ‘not, ‘you’ and ‘I’. If I select ‘you’ from this list, the next suggestions are ‘think’, ‘have’ and ‘want’. This is based on n-gram probabilities. Based on a very large text, probabilities have been calculated for various n-grams i.e. sets of words which appear consecutively. So when we type ‘Do’, the bi-gram probabilities with 'Do' as the first word will be looked up and the words having the highest bigram probability when 'Do' is the first word will be shown. After 'you' is selected, we have tri-gram for 'Do you' and the next word and so on. So it is a highly mathematical operation involving probabilities. The n-gram probabilities for a huge set of words have already been calculated beforehand based on large text corpuses.

There are several algorithms being used for different NLP tasks such as translation, spell check, summarization, voice recognition and information retrieval.

Consider the example of information retrieval. If we are given a sentence such as “It was in the spring of 2016 that Donald Trump singled out Ford Motors, calling its plans to build a plant in Mexico an absolute disgrace and promising it would not happen on his watch”, we need to extract ‘Donald Trump’ as name of person, ‘Ford Motors’ as company and ‘Mexico’ as location. This is called Named Entity Extraction and is used to convert unstructured data into structured data which can be easily queried or processed.

One way to do this would be to build explicit dictionaries of all names of people, companies and places. Then when we give a sentence as input, we compare all the words against these dictionary sets. However, in this approach, the initial compilation of comprehensive dictionaries itself would be a difficult task not to mention the constant update for additions and changes. 

The other way is using machine learning techniques. We take a data set of a few thousand sentences and label them manually for all the named entities. This data set called training set is fed into a model which looks at the content and context (i.e. words around) and builds the rules automatically. An example of context would be that if there is a word after 'located in' it is a place. Such rules are created automatically by the model. The model is tested and is now ready for use. When a new sentence is given as an input, the model parses this sentence based on the rules it has built itself and provides the named entities. The model is based on supervised machine learning techniques such as decision trees or neural networks. The advantage of this approach is that it is scalable and does not need any human intervention.

Though it is highly fascinating to understand how math is being used in linguistics, the question remains whether NLP algos can write books and poems. If we consider basic informational articles, NLP algos are already doing that. A lot of news stories generated today are written by machines. But can an NLP algo in future write a piece of poetry which will make us emotional? I doubt but I would rather say I don’t know !

Nice to know- IP creation in this domain will only rise.

Like
Reply

To view or add a comment, sign in

More articles by Sameer Kulkarni

  • Random Notes on AI and ML

    Some interesting things from the world of Artificial Intelligence and Machine Learning - In China, a CCTV camera caught…

    1 Comment
  • How to stop IT projects from failing?

    There are various statistics which show that around 25 to 50% IT projects fail. And by failure, we mean the project is…

    2 Comments
  • Designing MIS

    As a company grows, Management Information System (MIS) reports keep increasing. Rarely are old reports pruned.

    3 Comments
  • ATM Cash Optimization

    The Problem How much cash should a bank keep in an ATM daily? This seemingly simple question has significant…

    4 Comments
  • The root of the problem

    While hiring programmers for my start-up, I asked a girl the approximate value of square root of 0.5 in her interview.

    13 Comments
  • How to popularize your sales apps

    The Problem Companies have invested in technology with an objective to increase the sales effectiveness of the…

Others also viewed

Explore content categories