Generating and Language Translating Articles using AI: A Step-by-Step Guide

Akshay Mathur

Published Jan 25, 2023

Generating articles using AI has never been easier, thanks to the power of pre-trained models like GPT-2 and mBART. In this post, we'll walk you through the step-by-step process of how to use these models to generate an article on a given topic, and then translate it into any language of your choice.

GPT-2 (Generative Pre-trained Transformer 2) is a pre-trained transformer-based model developed by OpenAI. It is trained on a large dataset of web pages and is fine-tuned to perform well on various natural language processing tasks such as language translation, summarization, and text generation.

In this guide, we are using the GPT-2 model to generate an article on the topic 'Benefits of Sleeping Early'. The model is pre-trained and fine-tuned on a large dataset of web pages, so it is able to generate coherent and meaningful text on a given topic.

MBART (Multilingual denoising autoencoder for language understanding) is a pre-trained transformer-based model developed by Facebook AI. It is trained on a diverse set of monolingual data from many different languages and fine-tuned on many-to-many machine translation tasks. This allows it to perform well on tasks that involve translating between multiple languages.

In this guide, we are using the mBART model to translate the generated article from English to another language of your choice. The model is pre-trained and fine-tuned on a diverse set of monoling of languages, so it can translate the text accurately and fluently.

The pip install transformers command in step 1 installs the Hugging Face's transformers library, which is a library that allows you to easily use pre-trained transformer models such as GPT-2 and mBART in your code.
In step 2, we import PyTorch and GPT-2 by running import torch and from transformers import GPT2LMHeadModel, GPT2Tokenizer. PyTorch is an open-source machine learning library that is used to train and run the GPT-2 model. The GPT2LMHeadModel and GPT2Tokenizer classes from the transformers library are used to load the GPT-2 model and tokenizer respectively.
In step 3, we load the GPT-2 tokenizer and model by running tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") and model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id). The tokenizer is used to encode the input text and the model is used to generate the output text. We are using the "gpt2-large" version of the model, which is a large version of the model that has been trained on a larger dataset than the smaller versions.
In step 4, we set the topic for the article by running topic = 'Benefits of Sleeping Early'. This sets the topic on which the GPT-2 model will generate an article.
In step 5, we encode the input topic by running input_ids = tokenizer.encode(topic, return_tensors='pt'). This converts the text of the topic into a numerical format that the GPT-2 model can understand. The return_tensors='pt' argument specifies that the input should be returned as a PyTorch tensor, which is the format that the GPT-2 model requires.
In step 6, we generate the article by running output = model.generate(input_ids, max_length=200, num_beams=30, no_repeat_ngram_size=4, early_stopping=True). The model.generate() function generates text based on the input provided. The max_length argument specifies the maximum number of words that the generated article can contain. The num_beams argument specifies the number of different combinations of words that can be chained together. The no_repeat_ngram_size argument specifies the number of words that can be combined together and repeated. The early_stopping argument is set to True to stop the generation when the model is confident that it has generated a complete and coherent article.
In step 7, we print the generated article by running print(tokenizer.decode(output[0], skip_special_tokens=True)). The tokenizer.decode() function converts the numerical output of the model back into text. The skip_special_tokens argument is set to True to remove any special tokens that the model may have added to the output.

Generating article using AI and translate in any language (GPT2 and MBart)

Step 1: pip install transformers

pip install transformers

Step 2: import pytorch and GPT2

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

Step 3: import tokenizer and model


tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large"

model = GPT2LMHeadModel.from_pretrained("gpt2-large", 
pad_token_id=tokenizer.eos_token_id))

Step 4: Set the topic(Pick any)

topic = 'Benefits of Sleeping Early'

Step 5: Encode the input i.e. topic


input_ids = tokenizer.encode(topic, return_tensors='pt')

input_ids = tokenizer.encode(topic, return_tensors='pt')

Step 6: Generate Blog

# Generate Blog

#max_lenth-Number of Words in the Article

Recommended by LinkedIn

LANGUAGE MODEL

Arpan Ghosh 2 years ago

NATURAL LANGUAGE PROCESSING

ARUNA A 1 year ago

A Walk-Through of the NLP Evolution

AI‐Tech Park 2 years ago

#num_beams-Number of different combination of words that can be chained together #no_repeat_ngram_size-No of words that be combined together and repeated, example: ['benefits of sleeping' can be repeated 2 times but not more ]


output = model.generate(input_ids, max_length=200, num_beams=30
no_repeat_ngram_size=4, early_stopping=True),

Step 7: Output

print(tokenizer.decode(output[0], skip_special_tokens=True))

Step 8: Save the output in a variable (‘article_en’ in this case)

article_en = tokenizer.decode(output[0], skip_special_tokens=True)

Step 9: import MBart model and tokenizer


from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")

Step 10: input

model_inputs = tokenizer(article_en, return_tensors="pt")

Step 11: Generate tokens to be converted into Polish

# translate from English to Polis

generated_tokens = model.generate(

  **model_inputs,

  forced_bos_token_id=tokenizer.lang_code_to_id["pl_PL"]
)

Step 12: Translated into Polish

translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True

translation)

Translated into Hindi

# translate from English to Hind

generated_tokens = model.generate(

  **model_inputs,

  forced_bos_token_id=tokenizer.lang_code_to_id["hi_IN"]

) i

#datascience #machinelearning #artificialintelligence

#chatgptai #chatgpt #languagetranslation #mbert #openaichatgpt

To view or add a comment, sign in

Generating and Language Translating Articles using AI: A Step-by-Step Guide

Akshay Mathur

Recommended by LinkedIn

More articles by Akshay Mathur

Others also viewed

Deep parsing is the key to natural language understanding

The Revolution in AI: Exploring the Capabilities of GPT-4

How Do Embeddings Work in a Large Language Model (LLM)?

Tokenisation – Finding the Building Blocks of Language

The Current State of Large Language Models (LLMs) in Generative AI: A Comprehensive Guide

Natural Language Processing Basics with spaCy (Part 2)

Baidu’s Enhanced Representation through kNowledge IntEgration: Explained

Machine Translation Approach for Gen-AI -?

LaBSE & LASER | Sentence Alignment

Data Preprocessing for Large Language Models

Pretraining Strategies for Large Language Models

Scaling Large Language Models from GPT-1 to GPT-3

Benefits of Fine-Tuning Large Language Models

Adopting AI Solutions For Language Translation Challenges

How Language Models Transform Information Discovery

How to Train Custom Language Models

Explore content categories

Recommended by LinkedIn

More articles by Akshay Mathur

Data Science Job Reality vs Expectation: A Comparison

A.I. learned how to improve itself far faster than humans can.

10 Algorithms data scientist should know - Strengths and Weaknesses

Here are 10 things true leaders do:

Want to become a good leader? Stay humble, be accessible to your team

White Ops Study Reveals Bot Fraud Will Cost Marketers More than $7 Billion in 2016 Without Increased Focus

Others also viewed

Deep parsing is the key to natural language understanding

The Revolution in AI: Exploring the Capabilities of GPT-4

How Do Embeddings Work in a Large Language Model (LLM)?

Tokenisation – Finding the Building Blocks of Language

The Current State of Large Language Models (LLMs) in Generative AI: A Comprehensive Guide

Natural Language Processing Basics with spaCy (Part 2)

Baidu’s Enhanced Representation through kNowledge IntEgration: Explained

Machine Translation Approach for Gen-AI -?

LaBSE & LASER | Sentence Alignment

Similar topics

Data Preprocessing for Large Language Models

Pretraining Strategies for Large Language Models

Scaling Large Language Models from GPT-1 to GPT-3

Benefits of Fine-Tuning Large Language Models

Adopting AI Solutions For Language Translation Challenges

How Language Models Transform Information Discovery

How to Train Custom Language Models

Explore content categories