COVID-19 Unsupervised Learning Model

COVID-19 Unsupervised Learning Model

Principal Component Analysis (PCA) is a machine learning technique that is typically used to reduce feature dimensionality. It is a powerful unsupervised learning algorithm that can both speed up model creation and produce results that are often less sensitive to input noise that is not representative of a true signal. Hence models utilizing PCA can be more stable in real-world conditions.

Background

A typical example use-case for PCA is a large set of image data. PCA effectively decomposes the images into their most meaningful components by creating a new basis, or frame of reference, to describe those images. In the basis space created by PCA, the images maintain a surprising degree of fidelity to the original data even when subject to a dramatic reduction in dimensionality. This is because the new basis is mathematically "maximally-explainable" (in terms of underlying variance as a function of total feature dimension).

The result is new data that already has it's most meaningful components isolated. This will both train a neural network faster than raw data and will typically result in a neural network model that provides more accurate classification.

What may be less well known is that PCA can also be extremely useful in working with and modeling geospatial time series data - such as the spread of a virus over different regions.

In this use case, each geospatial region is analogous to a single image file. The time series at that location is analogous to the contents of the image. Just as PCA can identify meaningful features across a set of images, it can identify the most important aspects of the time series.

Applying PCA to COVID-19

The COVID Tracking Project provides an API to detailed state and county level data. Python was used for running the PCA and for data visualization.

I've based this model off state-level COVID-19 deaths per million population rather than positive case counts as testing has been both insufficient and inconsistent across geographies. Even death data is likely significantly understated, but I suspect any geographic disparity in the quality of that data is more limited. Analysis is restricted to states with at least 5 cumulative deaths and 0.5 deaths per million as of 3/24 (two weeks ago) and a total population more than 4 million. This filter results in 19 qualifying states. Each time series is then aligned and begun on the day the state reach 1 death per million.

Cumulative state deaths by day since 1 death per million.

Given this, not all the time series are of the same length. While there are techniques to address missing data, I'm restricting this analysis to those 12 days in which we have full data for all 19 States.

This provides a predictive aspect to the PCA model as New York and Washington have more history than most other States and we know now that they provide examples of outcomes at the extremes of observable deaths (in the US). So a similarity in the PCA decomposition between another state and New York or Washington may be predictive on whether that state will see future deaths unfold in a pattern analogous to one of those observed extremes.

It should be noted that this is a very small amount of data to run through PCA: 19 time series each with only 12 data points. As a result the first two PCA components explain 99.5% of the variability seen across the data set.

In other words, the observed deaths in 19 States can be represented, to over 99% accuracy, by three time series:

No alt text provided for this image
  1. The first time series is just the mean across states - the expected outcome without any geographic variance in the death rate. It's mildly exponential
  2. The first PCA component is more strongly exponential and represents an acceleration against that mean performance. States with a positive coefficient against PCA1 are experiencing death rates climbing more quickly than typical. States with a negative PCA1 coefficient are seeing a more muted death rate.
  3. While the magnitude of the second PCA component is quite small, it is an interesting feature which possibly represents a second-order delayed impact from social distancing. Given the scale, however, a small negative value in PCA1 is currently much more meaningful than an offsetting PCA2 coefficient. We may see this component become increasingly important if social distancing effects start driving even more disparate impacts across differing States.

We can look where each state sits in an xy-plane defined from the coefficients of the first and second PCA.

Scatter plot of first and second PCA coefficients for each state.

It's clear from this how much New York is an outlier sitting some distance from all the other states in this basis space. Four states that are somewhat closer to New York and also fall in the danger zone of having a positive x-coefficient are: Michigan, Louisiana, Massachusetts, and New Jersey.

Those states have a component in their COVID-19 death time series which is accelerating rather than decelerating relative to the cohort of other states. Indeed, we can see that they are the worst performing states in the subsequent days following the 12 day training set.

Time series highlighting four states with most similar underlying similarity to New York.

What's particularly interesting is that for much of the training period New Jersey actually recorded fewer deaths than average. On day 7, Colorado had a higher death rate than New Jersey. Yet the PCA decomposition was able to isolate the importance of the seemingly small late stage uptick in New Jersey without being provided any intrinsic knowledge about the exponential nature of viral spread.

Dimensionality, Data Efficiency, and Big Data

While we are working with an extremely small volume of data here, one of the main powers of PCA is its application to models driven by big data. In this example we've compressed our data set from 228 values to only 74 values while maintaining a 99.5% accuracy.

  • The input training data consists of 19 States and 12 observed days.
  • The model consists of three time series each with 12 days (3x12=36) and a set of two basis coordinates for each of the 19 States (2x19=38).

Here we can visually see how well both the exponential growth in New York, and the more linear growth in Washington or California is represented within the same model framework.

Comparison between original data set and PCA reconstruction.

This is a relatively simple exercise with limited data, hence the 99.5% explained variance ratio from only 2 components, but I hope it provides some insight into the power of PCA and how it may be used beyond feature similarity and dimensional reduction.

To view or add a comment, sign in

More articles by Robert Wilde

  • A Seemingly Inconsistent Stock Market

    It's been several decades since I left investment banking, but a friend still in the field posed an interesting…

  • Expertise & Natural Paradoxes

    A few thoughts about COVID-19 and expertise. When Prime Minister Abe abruptly closed all schools back in February the…

Others also viewed

Explore content categories