There Is No Data Like More Data - Is That So?
Image Source: Pixabay

There Is No Data Like More Data - Is That So?

A heuristic I lived by at university was "There is no data like more data". The professors kept hammering this phrase into the students's brains almost every lecture. Especially in Machine Learning, Deep Learning, and for the development of Artificial Intelligence you cannot accidentally overhear that specific quote.

Yet, the question arises: Is it true? Does it really only take masses and masses of data to have your algorithm run better? Well, my experience taught me better than this mere simplicity.

There is no doubt that it's useful to have quite a decent amount of training data to train your algorithm. Right from that perspective: Sure, there is no data like more data. What should do the trick, however, is to use the right data for training. Yet again, a different question arises: What is the right data? Let me illustrate this with a little anecdote.

Back in the past I worked on a project with a bunch of computational linguistics (my kind of people) for which we were expected to build and train an automatic speech recognizer (ASR). The group was recorded while chitchatting and having colloquial conversations with each other in a laid-back atmosphere. The result of editing and preprocessing hours and hours of recordings eventually turned out to be our test data; to be tested on the trained ASR system.

Here's the catch: The ASR was to be trained on a whole different set of training data, which was not colloquial speech and conversation small talk, but rather a corpus built with input from a renowned news magazine; meaning, a completely different domain. We all assumed that the word error rate (WER - a common metric to analyze an ASR system's performance) would not score well, but little did we know how badly indeed the system would perform at the end of the day.

The results were disappointing. We even started taking bets on who's team would yield a WER less than 66 percent (the smaller the percentage, the better the system), which is a really, really bad outcome already. Still, some other teams hit the 80 percent mark - a devastating conclusion; on the one hand.

On the other hand, we learned the hard way. Yes, training data is essential and there should not be too little of it in order to get your system the training it requires to perform appropriately. Nevertheless, it's equally crucial to use domain-specific data that fits the test data the algorithm is expected to work on. There is no use in gathering nonsense data and a bajillion amounts of corpora, then just blindly apply it for training without scrutinizing the intent. Always keep in mind integrating data that actually suits the algorithm's purpose.

Further than this, data quality is vitally important, too. What purpose does it serve to have masses of training data missing your test data's point, and on top of that, having it poorly preprocessed for the given objective? Right, it would all be in vain.

Key take-away: Make yourself aware of your algorithm's purpose - what does it serve for? How is the data I feed the system interpreted? To what domain do my training and test data belong? Is my data preprocessed accordingly and can the machine properly read the input? If you have clear answers to all of these questions, you can be sure of having the right data for your ML algorithm.



To view or add a comment, sign in

More articles by Julia Widera

  • How to Scare Users with Your Software

    The way you handle errors and issues in your software might explain why users are literally afraid of using it…

  • A Writer's Guide to Characteristics of Specific Styles

    There are many styles of writing out there: Academic, journalism, UX, or fictional writing and much more, just to name…

  • Design Does Not Equal Design

    As a user experience designer and user requirements engineer, I frequently find myself misunderstood concerning my…

Others also viewed

Explore content categories