Automated Data Augmentation in Machine Learning

Automated Data Augmentation in Machine Learning

The performance of most Machine Learning (ML) models (e.g. deep learning neural network models) depend on quantity and diversity of data. Companies use data augmentation to reduce dependency on training data preparation and build more accurate machine learning models faster. Data augmentation is an approach for generating data for ML models.

What is Data Augmentation?

Data augmentation can be defined as techniques that are used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data.

Why is Data Augmentation Important Now?

Machine learning applications especially in deep learning domain continue to diversify and increase rapidly. Data augmentation techniques may be a good tool against challenges which artificial intelligence world faces.

Data augmentation is useful to improve performance and outcomes of machine learning models by forming new and different examples to train datasets. If dataset in a machine learning model is rich and sufficient, the model performs better and more accurate.

For machine learning models, collecting and labeling of data can be exhausting and costly processes. Transformations in datasets by using data augmentation techniques allow companies to reduce these operational costs.

One of the steps into a data model is cleaning data which is necessary for high accuracy models. However, if cleaning reduces the representability of data, then the model cannot provide good predictions for real world inputs. Data augmentation techniques enable machine learning models to be more robust by creating variations that the model may see in the real world.

What is Interest in Data Augmentation?

Interest in data augmentation techniques has been growing during the last five years as you can see below. One of the reasons of this interest is the increasing interest in deep learning models.

Data Augmentation

How Does it Work?

Computer vision applications use common data augmentation methods for training data. There are classic and advanced techniques in data augmentation for image recognition and natural language processing.

Data Augmentation

For Image Classification and Segmentation

For data augmentation, making simple alterations on visual data is popular. In addition, generative adversarial networks (GANs) are used to create new synthetic data. Classic image processing activities for data augmentation are

  • Padding
  • Random rotating
  • Re-scaling
  • Vertical and horizontal flipping
  • Translation (image is moved along X, Y direction)
  • Cropping
  • Zooming
  • Darkening & brightening/color modification
  • Gray scaling
  • Changing contrast
  • Adding noise
  • Random erasing

Data Augmentation

Advanced models for data augmentation are

  • Adversarial training/Adversarial machine learning: It generates adversarial examples which disrupt a machine learning model and injects them into dataset to train.
  • Generative adversarial networks (GANs): GAN algorithms can learn patterns from input datasets and automatically create new examples which resemble into training data.
  • Neural style transfer: Neural style transfer models can blend content image and style image and separate style from content.
  • Reinforcement learning: Reinforcement learning models train software agents to reach attain their goals and make decisions in a virtual environment.

Popular open source python packages for data augmentation in computer vision are Keras ImageDataGenerator, Skimage and OpeCV.

For Natural Language Processing (NLP)

Data augmentation is not as popular in the NLP domain as in computer vision domain. Augmenting text data is difficult, due to complexity of a language. Common methods for data augmentation in NLP are

  • Easy Data Augmentation (EDA) operations: synonym replacement, word insertion, word swap and word deletion
  • Back translation
  • Contextualized word embeddings

Data Augmentation in TensorFlow and Keras

To augment images when using TensorFlow or Keras as our deep learning framework we can:

  • Write our own augmentation pipelines or layers using tf.image.
  • Use Keras preprocessing layers
  • Use ImageDataGenerator

Tf.image

Let’s take a closer look on the first technique and define a function that will visualize an image and then apply the flip to that image using tf.image. You may see the code and the result below.

Data Augmentation
Data Augmentation

For finer control you can write your own augmentation pipeline. In most cases it is useful to apply augmentations on a whole dataset, not a single image. You can implement it as follows:

Data Augmentation

Keras Pre-Processing 

As mentioned above, Keras has a variety of preprocessing layers that may be used for Data Augmentation. You can apply them as follows:

Data Augmentation
Data Augmentation

Keras ImageDataGenerator

Also, you may use ImageDataGenerator (tf.keras.preprocessing.image.ImageDataGenerator) that generates batches of tensor images with real-time data augmentation.

Data Augmentation
Data Augmentation

What are the Benefits of Data Augmentation?

Benefits of data augmentation include:

  • Improving model prediction accuracy:

  1. adding more training data into the models
  2. preventing data scarcity for better models
  3. reducing data overfitting ( i.e. an error in statistics, it means a function corresponds too closely to a limited set of data points) and creating variability in data
  4. increasing generalization ability of the models
  5. helping resolve class imbalance issues in classification

  • Reducing costs of collecting and labeling data

What are the Challenges of Data Augmentation?

  • Companies need to build evaluation systems for quality of augmented datasets. As use of data augmentation methods increases, assessment of quality of their output will be required.
  • Data augmentation domain needs to develop new research and studies to create new/synthetic data with advanced applications. For example, generation of high-resolution images by using GANs is challenging
  • If real dataset contains biases, data augmented from it will contain biases, too. So, identification of optimal data augmentation strategy is important.

Conclusion

To sum up, data augmentation is a technique that can be used to artificially expand the size of a training set by creating modified data from the existing one. In this article, we have figured out what data augmentation is, what data augmentation techniques are there, and what libraries you can use to apply them.

Follow me on GithubFacebookTwitter, and LinkedIn to see similar posts.

Any comments or if you have any question, write it in the comment.

Clap it! Share it! Follow Me!

Happy to be helpful.

Eng. Bilal (EL) JAMAL

To view or add a comment, sign in

More articles by Bilal EL JAMAL

  • Speech Emotion Recognition

    Abstract As part of the Foundations curriculum at Holberton School, students propose and build Minimum Viable Products…

    1 Comment
  • Multivariate Time Series Forecasting

    Introduction This concise article will demonstrate how “Time Series Forecasting” can be implemented using Recurrent…

  • Bayesian Optimization for Hyperparameter Tuning

    A comprehensive guide to understanding hyper-parameter optimization using Bayesian optimization with GPyOpt library in…

  • Transfer Learning Model

    Abstract In the following article we will talk about transfer learning in machine learning, because it is a very…

  • ImageNet Classification with Deep Convolutional Neural Networks

    This is a summary of Krizhevsky et. al.

  • Regularization Techniques in Machine Learning

    In the context of machine learning, the term regularization refers to a set of techniques that help the machine to…

  • Machine Learning Optimization Techniques

    What is ML Optimization? The principal goal of machine learning is to create a model that performs well and gives…

    2 Comments
  • Activation Functions in Neural Networks

    What is Activation Function? It’s just a thing function that you use to get the output of node. It is also known as…

    1 Comment

Others also viewed

Explore content categories