Run parallel, run better

Marco Aprea

Published May 22, 2020

In these months, while working on my master thesis project, i faced up a common problem for all data scientist... the running time of our code.

We always heard about some important qualities of a good software, for example the efficiency, the scalability, the reusability, the reliability and so many others. But how much we care about these while working on our small projects?

My work is about evaluating different Automatic Text Summarization algorithms. These algorithms, from the NLP field, aim to generate a summary of a text given in input. It sounds very interesting!

About Text Summarization

Text Summarization algorithms can be divided in two categories: Extractive and Abstractive.

The Extractive ones compute a summary by selecting the most important k sentences from the input text, and then merge together to get the output. An example is the PageRank algorithm developed by Google.
Abstactive summarization instead is more sofisticated. It relies on complex NLP approaches and massive use of Neural Networks. In fact the summary output, generated by these algorithms, is a completely novel text, that have different words and different structure compared to the input text. More like an human summary.

So my task is to evaluate 8 differents algorithms about Text Summarization through a metric called ROUGE (without going into depth, it just compute a similarity measure between system generated summary and a reference human summary).

Facing the problem

So... 8 algorithms plus the evaluation metric. All of this on the CNN\DailyMail dataset (about 300k entry, each composed by a text and the relative human summary). It means a lot of time and resource to compute. Sadly, i don't have a super expensive - 20 core - CPU to run my Python code in acceptable times.

I'll give you some numbers:

The time to compute 100 entries of the data for the Abstractive algorithms is ~ 360 seconds for both parallel and serial computation. This why as said before, these algorithms uses some heavy artificial intelligence models, who already involve the full power available on the machine.
The average CPU load for the Extractive algorithms for the serial computation is ~47%. Instead for the parallel version we have ~99%. We use so much resouces!
The average ratio between serial an parallel algrotihms is ~0.6! It means the parallels ones takes about 40% less time than serial ones. We can see in the graph below the time required for the computation of different dataset sizes. We notice that as the batch size growth, much is the gain with prallel computation.
Doing some easy calculus, i can expect that for the whole dataset (~300k entries), serial algorithm will take about 25 hours to finish. Parallel instead will archive its goal in ~13 hours!

Non è stato fornito nessun testo alternativo per questa immagine

Findings and conclusion

So i've saved 12 hours of waiting my code end run. This have gave me a huge effort to my work, allowing to make more tests and saving lots of time overall.

Many times, we don't care about these little details while developing small projects. Sometimes, like in my case, few lines of Python code can turn the tables! Hopefully today we have very price-convenient hardware. We can buy 12 core CPU for less than 200 €. It will be a waste of resource not use its full potential.

Run parallel, run better

Marco Aprea

About Text Summarization

Facing the problem

Findings and conclusion

More articles by this author

Others also viewed

From Finance to Data Science: My Journey of Going from Spreadsheets to Machine Learning

How to Implement Explainability in ML Models

Random Forest Algorithm, An Interactive Discussion

AI/ML Testing - Usage of Pytest and Pytorch

Think Big, Solve Small: How Small Models Are Outperforming AI Giants in Math!

10 Algorithms data scientist should know - Strengths and Weaknesses

Begin Your Machine Learning Journey from Scratch: A Step-by-Step Guide 🤖🚀

La Bitácora: From ML to LLM

Top 10 Open Source AI Libraries for Professionals: A Comprehensive Guide

Machine Learning and document analysis a Practical approach

Explore content categories

About Text Summarization

Facing the problem

Findings and conclusion

A web scraping journey

Jun 11, 2020

A personal news analyzer

May 26, 2020

Others also viewed

From Finance to Data Science: My Journey of Going from Spreadsheets to Machine Learning

How to Implement Explainability in ML Models

Random Forest Algorithm, An Interactive Discussion

AI/ML Testing - Usage of Pytest and Pytorch

Think Big, Solve Small: How Small Models Are Outperforming AI Giants in Math!

10 Algorithms data scientist should know - Strengths and Weaknesses

Begin Your Machine Learning Journey from Scratch: A Step-by-Step Guide 🤖🚀

La Bitácora: From ML to LLM

Top 10 Open Source AI Libraries for Professionals: A Comprehensive Guide

Machine Learning and document analysis a Practical approach

Similar topics

Why Large Language Models Require More Computing Power

Explore content categories