Run parallel, run better
In these months, while working on my master thesis project, i faced up a common problem for all data scientist... the running time of our code.
We always heard about some important qualities of a good software, for example the efficiency, the scalability, the reusability, the reliability and so many others. But how much we care about these while working on our small projects?
My work is about evaluating different Automatic Text Summarization algorithms. These algorithms, from the NLP field, aim to generate a summary of a text given in input. It sounds very interesting!
About Text Summarization
Text Summarization algorithms can be divided in two categories: Extractive and Abstractive.
- The Extractive ones compute a summary by selecting the most important k sentences from the input text, and then merge together to get the output. An example is the PageRank algorithm developed by Google.
- Abstactive summarization instead is more sofisticated. It relies on complex NLP approaches and massive use of Neural Networks. In fact the summary output, generated by these algorithms, is a completely novel text, that have different words and different structure compared to the input text. More like an human summary.
So my task is to evaluate 8 differents algorithms about Text Summarization through a metric called ROUGE (without going into depth, it just compute a similarity measure between system generated summary and a reference human summary).
Facing the problem
So... 8 algorithms plus the evaluation metric. All of this on the CNN\DailyMail dataset (about 300k entry, each composed by a text and the relative human summary). It means a lot of time and resource to compute. Sadly, i don't have a super expensive - 20 core - CPU to run my Python code in acceptable times.
I'll give you some numbers:
- The time to compute 100 entries of the data for the Abstractive algorithms is ~ 360 seconds for both parallel and serial computation. This why as said before, these algorithms uses some heavy artificial intelligence models, who already involve the full power available on the machine.
- The average CPU load for the Extractive algorithms for the serial computation is ~47%. Instead for the parallel version we have ~99%. We use so much resouces!
- The average ratio between serial an parallel algrotihms is ~0.6! It means the parallels ones takes about 40% less time than serial ones. We can see in the graph below the time required for the computation of different dataset sizes. We notice that as the batch size growth, much is the gain with prallel computation.
- Doing some easy calculus, i can expect that for the whole dataset (~300k entries), serial algorithm will take about 25 hours to finish. Parallel instead will archive its goal in ~13 hours!
Findings and conclusion
So i've saved 12 hours of waiting my code end run. This have gave me a huge effort to my work, allowing to make more tests and saving lots of time overall.
Many times, we don't care about these little details while developing small projects. Sometimes, like in my case, few lines of Python code can turn the tables! Hopefully today we have very price-convenient hardware. We can buy 12 core CPU for less than 200 €. It will be a waste of resource not use its full potential.