First Urdu language model bert based by Urduhack team

Ikram Ali

Published Aug 18, 2020

Welcome to the world of Language models. The idea of transfer learning in NLP took the world by storm when google introduced its amazing language model called BERT. BERT set new State-of-the-art(SOTA) on many NLP benchmarks.

BERTs for different languages like Chinese, french and german etc inspired us to put forward our efforts in pre-training a BERT like a language model for Urdu.

Introducing you to Roberta-urdu-small.

This is the first language model for Urdu language and we hope to release bigger and better models in future with different architectures like BERT/ALBERT etc. We have embarked upon this journey of language modelling and we hope to make a great contribution. We would love to see support and contribution from the Machine Learning community.

Training Data

A huge amount of data is a pre-requisite for pre-training a language model. Data for the Urdu language is available mainly in the news. We scraped data for the last 10 years from Pakistan’s popular Urdu newspapers and other resources also include.

Pre-training

We opt for Roberta architecture instead of BERT for its better accuracy and a better pre-training approach. Roberta architecture is the same as BERT except that it is just pre-trained on masked-language-mode (MLM) task instead of both MLM and next-sentence-prediction (NSP) task. NSP task does not contribute much to model performance on downstream tasks and removing it reduces computation complexity and training time.

The model was pre-trained for many epochs on the P-100 Tesla GPU. It took a long time to Train.

FIll MASKED WORD

Let’s test our model on MLM task. In this task, we randomly mask a word from our input sentence and the model predicts the masked word.

https://huggingface.co/urduhack/roberta-urdu-small

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="roberta-urdu-small", tokenizer="roberta-urdu-small")

results = fill_mask("حکومت کی <mask> رنگ لے آئیں") #masked_word = کوششیں
# {'sequence': '<s>حکومت کی کوششیں رنگ لے آئیں</s>', 'score': 0.07726366072893143, 'token': 4356}
# {'sequence': '<s>حکومت کی خدمات رنگ لے آئیں</s>', 'score': 0.05506495386362076, 'token': 2284}