Generating and Language Translating Articles using AI: A Step-by-Step Guide
Generating and Translating Articles using AI: A Step-by-Step Guide

Generating and Language Translating Articles using AI: A Step-by-Step Guide

Generating articles using AI has never been easier, thanks to the power of pre-trained models like GPT-2 and mBART. In this post, we'll walk you through the step-by-step process of how to use these models to generate an article on a given topic, and then translate it into any language of your choice.

GPT-2 (Generative Pre-trained Transformer 2) is a pre-trained transformer-based model developed by OpenAI. It is trained on a large dataset of web pages and is fine-tuned to perform well on various natural language processing tasks such as language translation, summarization, and text generation.

In this guide, we are using the GPT-2 model to generate an article on the topic 'Benefits of Sleeping Early'. The model is pre-trained and fine-tuned on a large dataset of web pages, so it is able to generate coherent and meaningful text on a given topic.

MBART (Multilingual denoising autoencoder for language understanding) is a pre-trained transformer-based model developed by Facebook AI. It is trained on a diverse set of monolingual data from many different languages and fine-tuned on many-to-many machine translation tasks. This allows it to perform well on tasks that involve translating between multiple languages.

In this guide, we are using the mBART model to translate the generated article from English to another language of your choice. The model is pre-trained and fine-tuned on a diverse set of monoling of languages, so it can translate the text accurately and fluently.

  • The pip install transformers command in step 1 installs the Hugging Face's transformers library, which is a library that allows you to easily use pre-trained transformer models such as GPT-2 and mBART in your code.
  • In step 2, we import PyTorch and GPT-2 by running import torch and from transformers import GPT2LMHeadModel, GPT2Tokenizer. PyTorch is an open-source machine learning library that is used to train and run the GPT-2 model. The GPT2LMHeadModel and GPT2Tokenizer classes from the transformers library are used to load the GPT-2 model and tokenizer respectively.
  • In step 3, we load the GPT-2 tokenizer and model by running tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") and model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id). The tokenizer is used to encode the input text and the model is used to generate the output text. We are using the "gpt2-large" version of the model, which is a large version of the model that has been trained on a larger dataset than the smaller versions.
  • In step 4, we set the topic for the article by running topic = 'Benefits of Sleeping Early'. This sets the topic on which the GPT-2 model will generate an article.
  • In step 5, we encode the input topic by running input_ids = tokenizer.encode(topic, return_tensors='pt'). This converts the text of the topic into a numerical format that the GPT-2 model can understand. The return_tensors='pt' argument specifies that the input should be returned as a PyTorch tensor, which is the format that the GPT-2 model requires.
  • In step 6, we generate the article by running output = model.generate(input_ids, max_length=200, num_beams=30, no_repeat_ngram_size=4, early_stopping=True). The model.generate() function generates text based on the input provided. The max_length argument specifies the maximum number of words that the generated article can contain. The num_beams argument specifies the number of different combinations of words that can be chained together. The no_repeat_ngram_size argument specifies the number of words that can be combined together and repeated. The early_stopping argument is set to True to stop the generation when the model is confident that it has generated a complete and coherent article.
  • In step 7, we print the generated article by running print(tokenizer.decode(output[0], skip_special_tokens=True)). The tokenizer.decode() function converts the numerical output of the model back into text. The skip_special_tokens argument is set to True to remove any special tokens that the model may have added to the output.

Generating article using AI and translate in any language (GPT2 and MBart)

Step 1: pip install transformers

pip install transformers        

Step 2: import pytorch and GPT2

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
        

Step 3: import tokenizer and model


tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large"

model = GPT2LMHeadModel.from_pretrained("gpt2-large", 
pad_token_id=tokenizer.eos_token_id))        

Step 4: Set the topic(Pick any)

topic = 'Benefits of Sleeping Early'        

Step 5: Encode the input i.e. topic


input_ids = tokenizer.encode(topic, return_tensors='pt')        

input_ids = tokenizer.encode(topic, return_tensors='pt')

Step 6: Generate Blog

# Generate Blog        

#max_lenth-Number of Words in the Article

#num_beams-Number of different combination of words that can be chained together #no_repeat_ngram_size-No of words that be combined together and repeated, example: ['benefits of sleeping' can be repeated 2 times but not more ]


output = model.generate(input_ids, max_length=200, num_beams=30
no_repeat_ngram_size=4, early_stopping=True),        

Step 7: Output

print(tokenizer.decode(output[0], skip_special_tokens=True))        
No alt text provided for this image

Step 8: Save the output in a variable (‘article_en’ in this case)

article_en = tokenizer.decode(output[0], skip_special_tokens=True)        

Step 9: import MBart model and tokenizer


from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")        

Step 10: input

model_inputs = tokenizer(article_en, return_tensors="pt")        

Step 11: Generate tokens to be converted into Polish

# translate from English to Polis

generated_tokens = model.generate(

  **model_inputs,

  forced_bos_token_id=tokenizer.lang_code_to_id["pl_PL"]
)        

Step 12: Translated into Polish

translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True

translation)        


Translated into Hindi

# translate from English to Hind

generated_tokens = model.generate(

  **model_inputs,

  forced_bos_token_id=tokenizer.lang_code_to_id["hi_IN"]

) i        

#datascience #machinelearning #artificialintelligence

#chatgptai #chatgpt #languagetranslation #mbert #openaichatgpt

To view or add a comment, sign in

More articles by Akshay Mathur

Others also viewed

Explore content categories