The Mathematics of Language

Sameer Kulkarni

Published Feb 4, 2017

Mathematics is objective and precise while language is subjective and ambiguous. Math is driven by the left brain while language is driven by the right brain. Math is science while language is art! But what we are seeing is that despite these differences, math and language are meeting. This surprise fusion of these diametrically opposites is happening in the world of Natural Language Processing (NLP).

An excellent example of NLP is the feature of suggestions given while typing in a messenger such as WhatsApp. The moment I type ‘Do’, WhatsApp suggests ‘not, ‘you’ and ‘I’. If I select ‘you’ from this list, the next suggestions are ‘think’, ‘have’ and ‘want’. This is based on n-gram probabilities. Based on a very large text, probabilities have been calculated for various n-grams i.e. sets of words which appear consecutively. So when we type ‘Do’, the bi-gram probabilities with 'Do' as the first word will be looked up and the words having the highest bigram probability when 'Do' is the first word will be shown. After 'you' is selected, we have tri-gram for 'Do you' and the next word and so on. So it is a highly mathematical operation involving probabilities. The n-gram probabilities for a huge set of words have already been calculated beforehand based on large text corpuses.

There are several algorithms being used for different NLP tasks such as translation, spell check, summarization, voice recognition and information retrieval.

Consider the example of information retrieval. If we are given a sentence such as “It was in the spring of 2016 that Donald Trump singled out Ford Motors, calling its plans to build a plant in Mexico an absolute disgrace and promising it would not happen on his watch”, we need to extract ‘Donald Trump’ as name of person, ‘Ford Motors’ as company and ‘Mexico’ as location. This is called Named Entity Extraction and is used to convert unstructured data into structured data which can be easily queried or processed.

One way to do this would be to build explicit dictionaries of all names of people, companies and places. Then when we give a sentence as input, we compare all the words against these dictionary sets. However, in this approach, the initial compilation of comprehensive dictionaries itself would be a difficult task not to mention the constant update for additions and changes.

The other way is using machine learning techniques. We take a data set of a few thousand sentences and label them manually for all the named entities. This data set called training set is fed into a model which looks at the content and context (i.e. words around) and builds the rules automatically. An example of context would be that if there is a word after 'located in' it is a place. Such rules are created automatically by the model. The model is tested and is now ready for use. When a new sentence is given as an input, the model parses this sentence based on the rules it has built itself and provides the named entities. The model is based on supervised machine learning techniques such as decision trees or neural networks. The advantage of this approach is that it is scalable and does not need any human intervention.

Though it is highly fascinating to understand how math is being used in linguistics, the question remains whether NLP algos can write books and poems. If we consider basic informational articles, NLP algos are already doing that. A lot of news stories generated today are written by machines. But can an NLP algo in future write a piece of poetry which will make us emotional? I doubt but I would rather say I don’t know !

Soumya Choudhury 9y

Nice to know- IP creation in this domain will only rise.

See more comments

To view or add a comment, sign in

The Mathematics of Language

Sameer Kulkarni

More articles by Sameer Kulkarni

Others also viewed

What GPT-3 could do for the industry

Is Natural Language Processing (NLP) still relevant?

Natural Language Processing At Casai

PART II: ON NATURAL LANGUAGE PROCESSING (NLP)

Performing Natural Language Processing with R

Sentiment Analysis: beyond positive, negative, and neutral

Google’s intelligent new algorithm: Ketchup welcomes BERT!

Requirements Traceability and Open Source Natural Language Processing (NLP)

Artificial Intelligence Models for Language Processing

Explore content categories

More articles by Sameer Kulkarni

Random Notes on AI and ML

How to stop IT projects from failing?

Designing MIS

ATM Cash Optimization

The root of the problem

How to popularize your sales apps