There Is No Data Like More Data - Is That So?

Julia Widera

Published Sep 13, 2020

A heuristic I lived by at university was "There is no data like more data". The professors kept hammering this phrase into the students's brains almost every lecture. Especially in Machine Learning, Deep Learning, and for the development of Artificial Intelligence you cannot accidentally overhear that specific quote.

Yet, the question arises: Is it true? Does it really only take masses and masses of data to have your algorithm run better? Well, my experience taught me better than this mere simplicity.

There is no doubt that it's useful to have quite a decent amount of training data to train your algorithm. Right from that perspective: Sure, there is no data like more data. What should do the trick, however, is to use the right data for training. Yet again, a different question arises: What is the right data? Let me illustrate this with a little anecdote.

Back in the past I worked on a project with a bunch of computational linguistics (my kind of people) for which we were expected to build and train an automatic speech recognizer (ASR). The group was recorded while chitchatting and having colloquial conversations with each other in a laid-back atmosphere. The result of editing and preprocessing hours and hours of recordings eventually turned out to be our test data; to be tested on the trained ASR system.

Here's the catch: The ASR was to be trained on a whole different set of training data, which was not colloquial speech and conversation small talk, but rather a corpus built with input from a renowned news magazine; meaning, a completely different domain. We all assumed that the word error rate (WER - a common metric to analyze an ASR system's performance) would not score well, but little did we know how badly indeed the system would perform at the end of the day.

The results were disappointing. We even started taking bets on who's team would yield a WER less than 66 percent (the smaller the percentage, the better the system), which is a really, really bad outcome already. Still, some other teams hit the 80 percent mark - a devastating conclusion; on the one hand.

On the other hand, we learned the hard way. Yes, training data is essential and there should not be too little of it in order to get your system the training it requires to perform appropriately. Nevertheless, it's equally crucial to use domain-specific data that fits the test data the algorithm is expected to work on. There is no use in gathering nonsense data and a bajillion amounts of corpora, then just blindly apply it for training without scrutinizing the intent. Always keep in mind integrating data that actually suits the algorithm's purpose.

Further than this, data quality is vitally important, too. What purpose does it serve to have masses of training data missing your test data's point, and on top of that, having it poorly preprocessed for the given objective? Right, it would all be in vain.

Key take-away: Make yourself aware of your algorithm's purpose - what does it serve for? How is the data I feed the system interpreted? To what domain do my training and test data belong? Is my data preprocessed accordingly and can the machine properly read the input? If you have clear answers to all of these questions, you can be sure of having the right data for your ML algorithm.

To view or add a comment, sign in

There Is No Data Like More Data - Is That So?

Julia Widera

More articles by Julia Widera

Others also viewed

Using GPT-4 to perform Pure Mathematics

Machine Learning for dummies

We need to talk about reproducibility in machine learning - or do we?

Machine Learning (ML) and Artificial Intelligence (AI)

A brief look into machine learning models

Key Machine Learning Skills You’ll Need to Master AI and Deep Learning

Unlocking the Secrets of AI - Part 2: The Importance of Descriptive Statistics in AI - 1

Unlocking the Secrets of AI - Part 2: The Importance of Descriptive Statistics in AI - 1

5 Ways Machine Learning Makes Your Search Cognitive

HOW TO WRITE A RESEARCH PAPER

Why Good Enough Data Is Important

Strategies For Improving AI Models When Data Is Scarce

How Poor Data Affects AI Results

Reasons Authors Are Concerned About AI Training Data

Overcoming Data Limitations In AI Model Development

Sharing Data Responsibly In AI Model Training

How To Fine-Tune AI Models On Small Datasets

Explore content categories

More articles by Julia Widera

How to Scare Users with Your Software

A Writer's Guide to Characteristics of Specific Styles

Design Does Not Equal Design

Others also viewed

Using GPT-4 to perform Pure Mathematics

Machine Learning for dummies

We need to talk about reproducibility in machine learning - or do we?

Machine Learning (ML) and Artificial Intelligence (AI)

A brief look into machine learning models

Key Machine Learning Skills You’ll Need to Master AI and Deep Learning

Unlocking the Secrets of AI - Part 2: The Importance of Descriptive Statistics in AI - 1

Unlocking the Secrets of AI - Part 2: The Importance of Descriptive Statistics in AI - 1

5 Ways Machine Learning Makes Your Search Cognitive

HOW TO WRITE A RESEARCH PAPER

Similar topics

Why Good Enough Data Is Important

Strategies For Improving AI Models When Data Is Scarce

How Poor Data Affects AI Results

Reasons Authors Are Concerned About AI Training Data

Overcoming Data Limitations In AI Model Development

Sharing Data Responsibly In AI Model Training

How To Fine-Tune AI Models On Small Datasets

Explore content categories