Estimation of Coronavirus Evolution in Brazil Using Regression Models
Image from https://solidthinking.com/product/compose/

Estimation of Coronavirus Evolution in Brazil Using Regression Models

As a curious Engineer (it may sound redundant...) and motivated by posts from other Engineers (Matheus Torquato and Livio Mariano), I became interested in the evolution of Coronavirus in Brazil based on our cases in the first 15 days when the disease began to escalate. In order to reach this goal, I used Altair Compose as the Math tool to perform all calculations and to create all plots.

Exponential Growth

Since the disease spreads quickly, I decided to use only 15 days of data when the patient #1 started to transmit it, knowing that the first period of an epidemic follows exponential growth, which is characterized by the following behaviour:

No alt text provided for this image

Where:

  • N(t) is the number of cases at a given time t
  • N0 is the initial number of cases (i.e., initial value)
  • b is the number of people infected by each sick person (i.e., growth factor)

Which means in practical terms that the number of infected people itself is a factor in its own growth, leading to those worrying curves with sharp evolution in the number of cases, whose shape looks like this:

No alt text provided for this image

As we have data of cases on a daily basis, now it is necessary to compute the growth factor and the chosen method to achieve it was linear regression, a good starting point for statistical modelling due to its simplicity.

Linear Regression

Linear regression consists of a predictive analysis method to model the relationship between two variables by fitting a linear formula to the data set. Furthermore, this technique can also be used with correlations that are not intrinsically linear, therefore the model must be adapted to an exponential approach, which would be:

No alt text provided for this image

Once the data set was transformed to logarithm, a polynomial curve fitting was performed to estimate the coefficients of the curve and hence generate a new data set with the prediction of cases over the following days. The best fit computes a least-squares polynomial and the outcomes are the values log(No) and log(b) to compose the equation of the problem, allowing the prediction of the number of future cases based on the polynomial evaluation from day 16 onward.

The data representing the first 15 days may now be visualized along with the linear regression model and the actual data after day 16, as shown below:

No alt text provided for this image

The isolation measures marked in the plot were primarily taken in the most affected regions (greater São Paulo and greater Rio de Janeiro).

Some Remarks

1) On top of the asymptomatic cases that bring the statistics down worldwide, the data available in Brazil has other source of uncertainty because hospitals have been undertesting potential cases, unlike South Korea or Germany. This strategy was aligned with the guidelines of the Brazilian Government due to the lack of resources to perform as many tests as possible.

2) The communication may also be an issue in a continental country like Brazil, which leads to a delay in reported cases and a divergence of statistics between the Federal health institution and the State ones.

3) Lockdown and isolation are efficient measures to decelerate the contamination rate. As regions like greater São Paulo and greater Rio de Janeiro (together they account for approximately 50% of all cases) started to shut themselves down since the third week of the epidemics, days later we could notice that although the panorama was still critical, the curves became less steep.

4) Naturally a continuous exponential growth in real life does not make sense, because as individuals start to infect others, the number of new cases start to decrease due to a smaller number of people they can transmit the virus to. A different mathematical proposition would be the next step to take this factor into account, such as logistic regression.

Logistic Regression

Logistic regression is another formulation of regression analysis that assumes a sigmoid (S-shaped) relationship between two variables, such as the curve below:

No alt text provided for this image

Its output is binary and it has been widely used to characterize epidemiological models – in this case, the event is the infection by Covid-19. The logistic function is described by the following equation:

No alt text provided for this image

Where:

  • N(t) is the number of cases at a given time t
  • Nmax is the maximum number of cases
  • t0 is the day at sigmoid’s midpoint
  • k is the logistic growth rate

From the equation above, we can infer that as days pass, the actual number of cases approaches the maximum number of cases.

Using again the least-squares method to estimate the optimal parameters we notice that it gives a more realistic representation of the epidemics:

No alt text provided for this image

According to the equation, the peak in the number of cases will probably happen in less than 2 weeks, which is in line with other predictions posted on LinkedIn and reported by Brazilian news agencies. Let’s be positive, pull one’s weight and hope for the best, but due to the undertesting described above, unfortunately it is unlikely that Brazil will register less than 7000 cases by that time – perhaps this plot is accurate only from a qualitative standpoint to estimate when the peak is going to occur.

No alt text provided for this image

Please feel free to contact me / comment so we can carry on with this discussion with other Math strategies to take further steps in this analysis - it's a good way to have fun in such difficult times for modern society. Thank you for reading this post and stay safe!

Hi Roberta, first of all, nice post. So, could you please share a comparison between the estimation on march 28th and the most recent one? I'm curious too...

And here, the logistic regression for Spain and France :)

  • No alternative text description for this image

Well done. That's the right time to discuss this topic while It's not too late. Science and Engineering are crucial in this scenario. The authorities need to act based on evidence and science, not guessing.  The analysis you performed using linear and logistic regression is quite interesting for the initial stages of an epidemic disease transmission. If we are interested in analysing the entire duration of the COVID-19 spread we should use epidemiology models such as SIR, SEIR or SEQIJR. This latter, for example, is able to model the evolution of six different categories of a population: susceptible people (S), quarantine (Q) of exposed people (E), isolation (J) of infectious people (I), a recovered population (R). This method is quite complex as it takes into consideration variables such as rate of inflow of susceptible individuals into a region or community through birth or migration, rate of immunization of susceptible individuals, rate of recovery of isolated individuals, etc (the list of parameters is huge). This is what we (Centre for Data and Knowledge Integration for Health - CIDACS/Fiocruz Bahia and the Federal University of Bahia) have been doing in order to provide actionable insights to Brazilian authorities. The outputs of these models are being displayed on this dashboard: http://covid19br.org.

To view or add a comment, sign in

More articles by Roberta Varela

Others also viewed

Explore content categories