The Great Debate: There is not enough Testing! Data Augmentation and Machine Learning to Detect COVID-19

Kyle Tran

Published Apr 22, 2020

Introduction

1.1 Background

Coronavirus disease 2019 (COVID-19) is an infectious respiratory illness that can spread from person to person. The viral disease was initially detected in the capital of China’s Hebei Province, Wuhan, in late 2019, and ever since then, it has contributed to the global spread of the ongoing coronavirus pandemic. The majority of cases only cause cases of mild symptoms, but some lead to death through pneumonia and/or organ failure.

The spreading of the virus is mainly by proximity or contact on surfaces infected by the virus. Most surfaces can house the virus for 72 hours. The first three days of symptoms from the Coronavirus is the most contagious period. COVID-19 is suspected of having a transfer rate around 2-3, which means for every person infected, they will pass it on to another 2-3 people and so on.

1.2 Significance

COVID-19 is a viral illness that has never been recognized in people before. The problem behind this novel Coronavirus is the ambiguity and unknown severity of symptoms. Some patients have a fever, cough, shortness of breath, pneumonia in both lungs and even death. Currently, we have no vaccines dedicated to viral treatment for COVID-19.

1.3 Question

How can we use artificial intelligence to detect signs of COVID-19 from patients’ respiratory X-rays?

1.4 Hypothesis

Through machine learning, we will be able to detect high risk COVID-19 patients from X-rays with high confidence using data augmentation and an image recognition technique called convolutional neural networks.

Data

2.1 Data Sources

The Kaggle data, titled COVID-19 Chest X-Ray, contains images of COVID-19 cases as well as Middle East Respiratory Syndrome (MERS), Severe Acute Respiratory Syndrome (SARS), and Acute Respiratory Syndrome (ARDS) with well documented high-resolution X-Rays/CT-Scans (Bachir, 2020). There are 146 samples with attributes contained in a separate metadata table that provides descriptions for the images. “The columns that are included are:

Patientid (internal identifier)
offset (number of days since the start of symptoms or hospitalization for each image)
sex (M, F, or blank)
age ( years)
finding (which Pneumonia)
survival (Y or N)
view (for example, PA, AP, or L for X-rays and Axial or Coronal for CT scans)
modality (CT, X-ray, or something else)
date (date the image was acquired)
location (hospital name, city, state, country) importance from right to left.
filename
doi (DOI of the research article)
url (URL of the paper or website where the image came from)
license
clinical notes (about the radiograph in particular, not just the patient)
other notes (e.g. credit)

(Bachir, 2020, 1)”

Twenty-three images of normal samples from another Kaggle Dataset, titled COVID-19 Xray (train & Test Sets), is added to prevent from overfitting with initially only two samples, and it will be used later for out of sampling testing. The only data needed for the model (CNN) is the image and results(finding). With the small number of samples, we will need to simulate data by creating transformations of the images to add to the dataset.

2.2 Data Cleaning

We need to drop all of the columns that we do not need, so there would only be [[‘finding’, ‘filename’]]. Now, let us focus on finding columns and the values that we will be comparing. The goal is to differentiate a normal image and COVID-19, but there should be another category for other lung problems. We are going to change [[ARDS', 'SARS', 'Pneumocystis', 'Streptococcus' , 'E. Coli’, ‘Legionella’]], which are other respiratory problems, to Pneumonia. This way, the labels are simplified to only three values.

Figure #1: Metadata and finding for each image

2.3 Data Augmentation

Since neural networks need to understand the parameters or values fed as input, they require a lot of data or examples to be trained. The unfamiliarity with COVID-19 makes it difficult to have sufficient reports, so there are not a lot of examples/ data for the model to train. Data augmentation is to make minor alterations to the dataset to allow the model to believe that the images are distinct. Such as the example below from Nanonets.

Figure #2: Visualization of Data Augmentation (Gandhi, 2018, 1)

2.4 Train Test Split

In order for us to verify the accuracy of our model, we must split a small percentage for testing. We will be setting aside 30% of the samples to ensure that the validation accuracy is as authentic as possible. The splitting is conducted before because we do not want our testing data to be augmented.

Methodology/ Procedure

3.1 Preprocessing

In order for our model to train with backpropagation, the images need labeling to allow the model to determine whether they were right or wrong. We will combine the path links for the images with their labels indexed [[‘No Finding’, ‘COVID-19’, ‘Pneumonia’]], which is [[0, 1, 2]].

The image is transformed into a NumPy array based on their pixel values (0, 255)with the images resized into (150,150). After, we will split the image arrays into X, and labels into y.

To normalize the dataset, we will divide the image arrays by 255 based on the maximum pixel value. There will be a LabelEncoder to transform the labels in the y list by one-hot vectorizing the values from the filters.

We will be using the Keras data augmentation library to create simple transformations on the data. There will be:

Rotations range = 0.2
Width shift range = 20
Height shift range = 0.2
Shear range = 0.2
Zoom range = 0.2
Horizontal flips = True

3.2 Convolutional Neural Networks

The benefit of using CNN is that there would be higher representations for the features of the images. CNN’s input will take the pixel value to train and extract the features to improve the classification.

Figure #3: Visualization of Convolutional Neural Networks (Tatan, 2019, 1)

Convolutions go through windows of values on the grid with defined filters to emphasize defining features. Max pooling is to take the highest value to see which features are essential to prevent overfitting. The activation function ReLu to allow nonlinearity for positive values and returns no value if it is negative. The weights are still small, and it solves gradient issues.

In the final Fully Connected Layer, there will be a flattening layer for the outputs. There will be an activation function called softmax to find the probability of each of the labels. This will allow a classification of the images for non-binary labels.

3.3 Backpropagation

Backward propagation is a technique for supervised learning to calculate the gradient descent of weights. The values are calculated from the output and are fed back into the network.

3.3 Model and Tuning

There will be two models to compare the accuracy of to determine which should be deployed. We will be applying class weights to the model because there is a heavy proportional bias to COVID-19 images. We can adjust the number of nodes and layers to determine which focuses on the target application. In our problem, signs of COVID-19 are small relative to the images, so we will freeze the weights of the first layers to identify the curves and edges.

The image presented below is the model and summary of the CNN network. The parameter after the convolution layer is the number of filters, and in our case, the first layer has 32 filters with a window of 3x3 with ReLu as the activation function. The model will process an image tensor of size (150,150,3), and the last parameter 3 is to define the color channel we would be using. Since we are using an RGB color input, our channel depth is three instead of 1 in a grayscale image.

Figure #4: Code for Convolutional Neural model

Figure #5: Summary of the model to break down the layers and parameters

CNN is trying to learn using back-propagation from the values of filters, so the learning layers are the ones with parameters. The number of parameters is the number of learnable elements for the filter in the layer. For example, in the first conv2d layer, we can calculate the number parameters by ((height of filter * width of filter * the number of color channels) + 1) * the number of filters, so using the formula, we get ((3*3*3)+1)*32 = 896 parameters.

Results and Discussion

4.1 Testing and Validation

After training our model, we will look at the accuracy and losses. The graph below shows accuracy over the epochs of training; we are comparing between the training and validation datasets.

Figure #6: Graph of training and testing accuracy over 25 epochs.

Figure #7: Graph of training and testing loss over 25 epochs.

Even though the training accuracy is not increasing too much, we are more concerned at the validation accuracy. We want to know how well our model will perform in the real world. If our model’s training accuracy gets too high, then that means that our model is overfitting, remembering instead of learning.

After 25 epochs, the model ended with a validation accuracy of 85% which is not too bad, considering we only used 30% of the original dataset as the testing data.

4.2 Results using Outside Sample inputs

Let us now look at how well the model performed using out of sample images. Since we used categorical cross-entropy because we had three categories, the model will be looking for the probability of each of the categories, No finding, COVID-19, and Pneumonia, respectively. There was a preprocessing function I defined as produce to allow the model to have the same input as the training input.

The first example is of an image diagnosed with COVID-19. Our model predicted the image to have COVID-19 with an approximately 75% confidence.

Figure #8: Image of COVID-19 sample with model prediction at the bottom: 75% COVID-19

The second example is of an image diagnosed with Streptococcus/ Pneumonia. Our model predicted the image to have Pneumonia with an approximately 61% confidence.

Figure #9: Image of Streptococcus/ Pneumonia sample with model prediction at the bottom: 61% Pneumonia

The final image is of an image diagnosed with No Findings. Our model predicted the image to have Pneumonia with an approximately 36% confidence. It also felt that there was a 30.6% chance of No findings and 33.2% for COVID-19.

Figure #10: Image of No Finding sample with model prediction at the bottom: 36% Pneumonia

4.2 Discussion

The results of 85% accuracy from our 50 test samples support the hypothesis that we can use machine learning to label highly suspected COVID-19 X-rays. Since the dataset contained more COVID-19 samples, it seems to be slightly more biased than we would like, but in our case, it is better to have false positives than false negatives. We tried to combat the issue of the unbalance of data by using class weights, and it did help the overfitting. The dropout layer in the model was also needed compared to before.

As we can see from our results, the model was able to differentiate signs of COVID-19 and Pneumonia. From an untrained eye, it is challenging to notice the difference between these two respiratory illnesses, but the model was able to in the previous results. Our biggest issue was the lack of normal data, but it was limited to show how effective data augmentation can be in future scenarios. In the model, there seems to be a blurred line between normal and disease. It cannot confidently determine a normal X-ray until we provide the model more data.

The model is great as a preliminary scan to help flag X-rays if there are problems with X-rays before experts even diagnose the patients. Convolutional Neural Networks are the most preferred method to analyze photos, and the COVID-19 model does not say otherwise. The most effective way to increase our confidence and accuracy is to develop a built-in feedback channel like Zebra Medical Vision. They are a deep-learning imaging company that provides improved patient care by identifying patients’ risks of disease (Business Insider, 2019). Zebra has a constant feedback loop to allow developers to make changes based on the results from the algorithm after deployment. Using the feedback method, we would be able to generate more data while the model is being used.

Throughout this project, we have covered the issue of COVID-19, Data Cleaning, Preprocessing, Convolution Neural Networks, and evaluating results. This model is in no means for deployment and would need further replication in the field of science, but it is an excellent representation of what machine learning can do during stressful times. Our belief was that we would be able to train a model to detect COVID-19 with high accuracy using CNN and data augmentation. Even though our model is not near deployment, by using only 169 samples of images, we were able to have an accuracy of approximately 85%.

The model would need some more tuning and means of consistency because while looking at the evaluation graphs, we can see that there were spikes in performance. With adequate tuning and layers, we can find a stable model that will confidently be able to classify COVID-19 as well as other diseases. Hopefully, there will be more data on COVID-19 for the future because, as a machine learning engineer, we are only taught to chase the data. As hospitals and governments release more information on COVID-19, we will be able to more accurately predict X-ray images with near-perfect precision alongside doctors. We would be able to use machine learning models to analyze X-rays in a matter of seconds before the doctors provide a diagnosis. Deep-learning image recognition can be applied to a wide range of imaging activities, and the COVID-19 datasets allow us to use machine learning to help the healthcare workers in the frontlines be more selective on who needs testing when supplies are limited.

References

Understanding Cnn (convolutional Neural Network)

Vincent Tatan - https://towardsdatascience.com/understanding-cnn-convolutional-neural-network-69fd626ee7d4

A Comprehensive Guide To Convolutional Neural Networks - the Eli5 Way

Sumit Saha - https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Data Augmentation For Deep Learning

Alexandra Deis - https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9

Data Augmentation: How To Use Deep Learning When You Have Limited Data

Arun Gandhi - https://nanonets.com/blog/data-augmentation-how-to-use-deep-learning-when-you-have-limited-data-part-2/

Understanding Hyperparameters Optimization in Deep Learning Models: Concepts and Tools

Jesus Rodriguez - https://towardsdatascience.com/understanding-hyperparameters-optimization-in-deep-learning-models-concepts-and-tools-357002a3338a

Chest X-ray and Ct Scan For Covid-19 (Coronavirus)

Rony Kampalath - https://www.verywellhealth.com/medical-imaging-of-covid-19-4801178

Covid-19 Chest X Ray

Bachir - https://www.kaggle.com/bachrr/covid-chest-xray

Adrian Rosebrock- Baterdene-Mohamed Emad-Akbar Hidayatuloh-Akbar Hidayatuloh- Gilad-Sagar Patil- Sean-Michael Alex- Alex- Bostjan- Abdullah- Tyler- Mustafa-Akshay Mathur- Shashank-Navendu Sinha-Bog Flap-Shubham Kumar- Dirk- Matt-Peshmerge Morad-Shashank Rao- Lisa-Shubham Pandey- Arun- Shubham-Dave Xanatos- Bhatia- Danny- Diego- Lee- Josep- Aman- Daniel- Sridhar- Martin-Ben Bartling- Ben- Rui- Xavier-Victor Liendo-Suresh Kumar-Kemas Farosi- Justin- Jon- Miguel- Aniket- Mohammed-Marwa Said-Valentin Bouis-Ishay Kaplan-Artur Barseghyan-Ali Nawaz-Vishal Borana- Tomas- Huangz- Pepe- Isaac-Mustapha Nakbi- Rushad- Dave- Tomasz- Antoine-Son Vo- Flavio- Mohammad- Hassan- Amit- Farhan-Mitesh Patel-Shymaa Arkoub- Ishan- Jackson- Eduardo- Mohamed- Jeremias- Misha- Kyle- Mario- Irwin- Andreas- Wouter-Sachin Dhiman- Assem- Monika- Asad-Arsh Chadha- Khaw Oat- Amm - https://www.pyimagesearch.com/2018/04/16/keras-and-convolutional-neural-networks-cnns/

Covid-19 Image Data Collection

https://deepai.org/publication/covid-19-image-data-collection

Covid-19 Xray Dataset (train & Test Sets)

Wei Khoong - https://www.kaggle.com/khoongweihao/covid19-xray-dataset-train-test-sets

Covid-19 Basics

Harvard Health Publishing - https://www.health.harvard.edu/diseases-and-conditions/covid-19-basics

Zebra Medical Vision Joins Forces with Nuance To Bring More Ai To Diagnostic Imaging

https://www.businesswire.com/news/home/20191122005492/en/Zebra-Medical-Vision-joins-forces-Nuance-bring

Kevin Do 6y

Lets make this a reality :)

To view or add a comment, sign in

The Great Debate: There is not enough Testing! Data Augmentation and Machine Learning to Detect COVID-19

Kyle Tran

More articles by Kyle Tran

Others also viewed

AI as an ally in fighting epidemics

IT and Telecom are at forefront in battle against coronavirus

Synergizing Generative AI and Vector Databases for Advanced Disease Outbreak Management: Harnessing Prompt Engineering and Language Models

AI to the rescue? Researchers new AI model points to cure for Covid-19, and potentially all infectious disease.

Let's Talk - Importance of AI in Pandemic

Ideas from Jožef Stefan Institute how to use AI to fight against COVID -19 virus

Artificial Intelligence in Epidemic Forecasting and Outbreak Analysis: moving towards a proactive global health system

Detecting COVID-19 using AI

The Role of AI in Predicting Epidemics and Pandemics

Can Big Data Help Prevent Pandemics?

Accuracy of AI Models for Heart Disease Detection

AI Techniques For Medical Image Recognition

How AI can Improve Heart Attack Detection

How to Use AI to Improve Healthcare Access

Explore content categories