Language Models in NLP
Language models have completely changed the NLP world in the last 10 years. Language models emerged as one of the first steps in deep learning for NLP by applying pre-trained language models for embeddings. Language models applications are wide ranging from language translation, summarization, question-answering to Conversational AI. Here are 4 generations of language modeling evolution in the last 10 years: (1) first generation with word2vec and Glove; (2) second with contextual embeddings using sequence models, e.g., ELMO. (3) third with Transformers, e.g., BERT and USE. (4) fourth with large language models like GPT-3 and OPT. Here is the first part of the series of posts on language models starting with an overview of the first generation of language models like word2vec.
Word2vec came out in 2013 that addressed the curse of dimensionality in NLP (where even similar words were hard to match) and opened the door for deep learning in NLP. Word2vec embeds words in a continuous vector space where semantically similar words are mapped to nearby vector points. This enables very powerful vector operations. For example, you can say the vectors of “Rome” - “Italy” + “France”, and the resulting vector is closest to the vector for the word “Paris”.
Before word embedding a commonly used technique for word representations in NLP was a bag of words model. In the bag words model, a text is represented as a dictionary of words in the text and corresponding counts. It is an easy feature generation technique for text but suffers from various limitations. For example, it is hard to capture the order of words in a bag of words model. Also, it suffers from the curse of dimensionality, that is, the bag of words model has limited vocabulary and even similar words are hard to match.
In word2vec embedding, words are mapped in a higher dimensional vector space, typically 300. For example, the word Paris is mapped to a vector of 300 dimensions. Following are analogy examples, where pairs like country-city have similar vector differences (Mikolov et. al 2013).
How do you create these embeddings? In word2vec two approaches were suggested:
Recommended by LinkedIn
There have been subsequent enhancements to word2vec like Glove and FastText. Word2vec completely changed the NLP world 10 years back and led to the state of the art deep learning applications like language translation, text classification, summarization, entity extraction, question-answering, and conversational AI.
In the next few posts I will cover the subsequent generations of language models including contextual (e.g. ELMO), Transformer based (e.g., BERT and USE), and large-language models (e.g., GPT-3). Stay tuned!
nice post Mitul Tiwari, this has been truly transformational for NLP