Getting Started with Text Summarization
What is Text Summarization?
“Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).”
— Page 1, Advances in Automatic Text Summarization, 1999.
Across the industry, we often face a situation where we need to read a 50-page document to realise that there are only 5 takeaway points which are of value. Text Summarization is the task of condensing this piece of information into those 5 takeaway points while preserving key informational elements and the crux of the content.
Now at this age of Information, when content is being generated every second, manual text summarization is infeasible. With advances in Artificial Intelligence, Deep Learning and Natural Language Processing, we have tools to achieve this task and solve this problem of summarization for us.
“Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning”
-Text Summarization Techniques: A Brief Survey, 2017
Benefits of Automatic Text Summarization:
1. Summaries reduces reading time and presents actionable insights in a concise form.
2. During the selection of articles for more further analysis, summarization significantly quickens the process by reducing redundant data.
3. Automatic Text Summarization algorithms are less biased than human summarizers.
4. Using Automatic Text Summarization enables commercial abstract services to increase the number of text documents they can process.
Types of Text Summarization
Now, most importantly, based on the Type of Output, we have two main approaches:
Extractive Summarization
This approach selects the sentences from the text corpus which best summarizes the document and arranges these sentences to form the summary. The sentences are selected based on scoring functions and accordingly, we have different extractive summarization approaches like:
1. NLTK Summarizer: Scores sentences based on word frequencies.
2. Gensim Summarizer: Scores sentences using the Text Rank algorithm which is built upon the PageRank algorithm which is used by Google.
3. Summa Summarizer: It is an improvisation of the Gensim algorithm which optimises on the different sentences’ similarity scoring algorithms.
4. Extractive-BERT summarizer: Uses word embeddings followed by Clustering and chooses sentences closest to the cluster centres as the summary sentences. It uses co-referencing to resolve words in summaries that need more context.
Abstractive Summarization
This approach interprets texts using advanced Natural Language techniques involving Deep Learning to generate novel sentences, it entails paraphrasing and forms its own words and sentences to produce a more coherent summary, like what a human would generate.
Out of the box implementations that can be used are Abstractive-BERT summarizer and Google’s T5 summarizer. Recurrent Neural Networks can be trained using encoder-decoder models with Attention Layers to implement a novel abstractive summarizer.
Our Experience with Text Summarization
Based on our experience with implementing each of these techniques we observed that for documents where the crux is to portray a conversation, or the content can be perceived to be in the form of elaborate dialogue, abstractive summarizers understand the topic of this discussion and summarizes the document with its natural language understanding to come up with novel sentences to express the emotion of the document in a clear, concise and coherent way. However, when working with documents like resumes, where-in each statement itself is a summary of experience, extractive summarizers use algorithms like Text Rank algorithm to come up with the most important sentences and does a better job than Abstractive Summarizers.
Summary
Text Summarization as a field is slowly gaining traction in the wider field of Natural Language Processing (NLP), because of the impulse with which textual data is generated in this age and the ever-growing need to work on this data by shrinking it to aid downstream applications like news digests, report generation, news summarization and headline generation.
Extractive Summarizations are easier to implement and maintains the accuracy of text in the summary as they are directly extracted from the text corpus. However, to introduce paraphrasing and generalization we need to introduce abstractive methods. With growing advances in Deep Learning and NLP and as research in this area continues, we expect to see significant advances in the abstractive summarization domain soon enough to generate grammatically correct human-like summaries with speed, accuracy, and easier implementations.
You can try it out at https://colab.research.google.com/drive/1IsB4fWe7gt2YgbUjCEiRW36wqr1AS-oM?usp=sharing
https://github.com/anirbansaha96/AI-ML-Playground/blob/master/Google_T5_Abstractive_Summarizer.ipynb
You can find more about these at https://github.com/anirbansaha96/AI-ML-Playground/blob/master/abstractive_bert.ipynb