How to handle zero-inflated count data with R and Python

View profile for Keith McNulty
Keith McNulty Keith McNulty is an Influencer

One common challenge when you are modeling count outcomes like number of absences in a year, is that there are too many zeros in your data to reliably use a Poisson or negative binomial distribution. If this is the case, and if you believe that there may be underlying structural reasons why this may be occurring, you can split your model and use a binomial logistic regression to model membership of the zero group and then a poisson or NB regression on the non-zero count outcomes. This is called a zero-inflated model. The performance package in R has a neat function to test your model for zero-inflation, and the pscl package offers zero-inflated models which are easy to implement. In Python, statsmodels contains zero-inflated Poisson and NB models also. #analytics #rstats #python #datascience #peopleanalytics #ai #technology #statistics

  • graphical user interface, text, application

Such an important issue. Was working today with Asmare Gelaw on a zero-inflated time series dataset - before switching our model up we are exploring whether the epoch selection (weekly, fortnightly, monthly) is producing the zeroes, or whether something is structurally different at one of our three sites, which has more zeroes than the others. Time-series work is so nuanced.

Just a little feedback.. Before using a zero-inflated count model, it's a good idea to take a step back and check if the excess zeros are mostly produced by the standard count process: Is it just overdispersion that Negative Binomial model can handle or is the underlying data generation process really bimodal? With the test at the top (check_zeroinflation) motivating your model choice, this reads like your decision is data (vs. problem) driven.

Thanks for sharing! I was actually trying to use the performance package earlier to dig into diagnostics for a Bayesian multivariate GLM (2 Gaussian DVs + 1 Bernoulli DV). I didn’t realize it had this function! It would have been handy in a recent project.

Like
Reply

Timely post for me, long time follower here, first time reaching out. I have a predictive problem, with the data situation you describe, and was considering XGBoost based modeling, which seems like it may have intrinsic properties for dealing with this kind of data. Does this seem like a reasonable approach, or am I misunderstanding my recently acquired understanding of XGBoost?

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories