Using Machine Learning To Pick a Control Group

Bryce Sady

Published Dec 2, 2023

Why would you do a post-hoc analysis?

The gold standard for experimental design involves a test versus control model, where test and control subjects are drawn from the same sample, a priori.

However, this is not always feasible or practical. For instance, in continuing medical education, the aims prevent the arbitrary exclusion of participants. So we compare those who participated versus those who did not. In healthcare delivery settings, sometimes we can only retrospectively analyze observations not controllable beforehand.

For those of us without budgets for time travel, there is machine learning.

In this experiment, I will train a model to select a control group based on 60 variables about healthcare professionals that gave pneumococcal vaccinations in 2020 and 2021.

Pneumococcal Vaccination Rates: Normally distributed?

Physicians' pneumococcal vaccination rates in 2020 and 2021 reveal a multimodal distribution in the absolute change in procedure volume. (*If you want to know how I got to this data set, scroll to the bottom.) In contrast, total procedures and number of patients seen are both the log-normal distributions.

So while a randomly selecting physicians might yield physicians with a similar distribution of practice sizes, there is a chance that they differ in some other aspect, contributing to the variation in immunization rates. So our hypothesis is that random selection shouldn't create groups of similar immunization behavior and we'll need ML or other tools to do so.

Article content — From 2020 and 2021, the change in pneumococcal vaccination rates was not net positive and not normally distributed, but for physicians providing vaccination, the total number of services and number of patients seen were both log-normal.

Selecting Random Controls

Let's say we wanted to know the impact of an education campaign on 4,000 physicians, where only 15% participated. Comparing participants to non-participants, there's a chance their vaccination rates before the intervention might vary and thus randomly selecting a control group from non-participants may lead to a larger observed effect on the test group compared to the control, potentially suggesting a greater impact than what a larger sample might reveal. Alternatively, it might even lead to the conclusion that the intervention had no positive benefit at all.

If we look at the graph below, we'll see that the test group indeed has a different distribution of vaccination rates before the intervention from the randomly selected controls. Changing the size of the control pool doesn't guarantee an improvement in how the distributions match.

Preparing Data to Train the Model With

So let's try using ML, instead. We will draw on a dataset with 60+ variables which outline how many patients each healthcare professional saw, the demographics of those patients, the types of procedures and reimbursement for those claims, their specialty, and some other setting details. This is from the Medicare Part B Provider aggregate data from 2021. This data really should be supplemented by some data from the Compare dataset, but for today, let's just stick with this.

We're going to train the model on a randomly selected group of 600 participants that we'll call the "test" group, and we'll label everyone else "control". We'll then ask the trained model to label the "controls" for their predicted group, and anyone in the model predicts to be a "test" will be selected as the actual "control" group. If the model is good, there should be very little variance in distribution of immunization rates between the groups even though that's not one of the variables in the model.

There are a couple of machine learning strategies to choose from, and today I'm going to use k-Nearest Neighbor (KNN), which is appropriate for numeric-rich data. So I'm going to take the data on physician practice demographics (puf_agg)and filter it down to just those that did pneumococcal immunizations in 2021 or 2022 (puf_imm), remove geographic data that is not in the 50 states, select a random 600 participants and label them as "test" or "control". I'll them convert all the other key data to numeric formats, including specialty, and then remove any data we don't want to include.

Finally, I'll set NAs to zero, which, for our purposes, isn't consequential.

Recommended by LinkedIn

Clinical Analyst - Integration Analyst partnership and…

Vik Patel 11 years ago

NBPP comments, vaccine policy changes blocked, large…

National Pharmaceutical Council 1 month ago

Getting the Most out of MIPS Part 2: How to achieve…

Jessy Browne 8 years ago

train <- puf_agg %>%
  filter(Rndrng_NPI %in% puf_imm$Rndrng_NPI) |>
  filter(!Rndrng_Prvdr_State_FIPS %in% 
    c('9A','9B','9C','9D','9E')) |>
  mutate(
    class = as.factor(
      ifelse(row_number() %in% sample(row_number(), 600), 
             "test", "control")),
    Rndrng_Prvdr_State_FIPS = as.numeric(Rndrng_Prvdr_State_FIPS),
    Rndrng_Prvdr_Zip5 = as.numeric(Rndrng_Prvdr_Zip5),
    Rndrng_Prvdr_Gndr = ifelse(Rndrng_Prvdr_Gndr == "M", 1, 2),
    Rndrng_Prvdr_Type = as.numeric(
      factor(Rndrng_Prvdr_Type,
             levels = unique(Rndrng_Prvdr_Type)))) |>
  select(
    -c(Rndrng_Prvdr_Crdntls,
       Rndrng_Prvdr_State_Abrvtn,
       Rndrng_Prvdr_RUCA_Desc,
       Rndrng_Prvdr_Mdcr_Prtcptg_Ind,
       Drug_Sprsn_Ind,
       Med_Sprsn_Ind)
  );

train[is.na(train)] <- 0;

Training and Running the Model

Now we have a training set to pass to our model, which is the full set of tests and controls, unlabeled, and then the test set which, is, confusingly, my controls (because I'm trying to predict tests in this group), and finally the column in the training set that has the labels (the classification data). You'll notice that I don't include the NPI (which is notably void of any meaning) in the training or test dataframes passed to knn(). Finally, I choose a k value of 2. This means that the neighbors that will be considered to be "close" is just 2. You may find, if you use a dataset other than this one, that you need to change it, especially if you are not getting many positive "test" predictions. If you are getting too few values, then you need to reduce the number of variables in your training set. You can also re-run the model.

The resulting model I can use to add a new variable, predicted_class to the set of candidate controls. I then can filter for anyone that is not the class of "test" but is the predicted class of "test." This means that they are the most similar to the "test" group and thus can be used a very close "control," given how similar they are on other parameters.

# create set of control candidates from data
controls <- train |> filter(class == 'control')

# train the model
knn_model <- 
  knn(train = train |> 
        select(-c(Rndrng_NPI,  class)),
      test = controls |> 
        select(-c(Rndrng_NPI, class)),
      cl = train$class,
      k = 2);

# run the model
controls$predicted_class <- knn_model      

# view results
controls |> 
  group_by(class, predicted_class) |> 
  summarise(n())

# class     predicted_class    n()
# -------------------------------------
# control   control            40366
# control   test               308

Since this model is running over 40,000+ HCPs it takes a few minutes to run. In the end, it identifies >300 controls.

We should observe then, that the "pre" or 2021 immunization rate distributions should be similar between the "test" group and the ML-selected "control" group. And, because we didn't actually do any intervention on the randomly-selected "test" group, we shouldn't see any significant difference between the test and control group in terms of immunization rate change.

Model Validation

If the visual comparison of the distributions isn't convincing, there is a test, the Kolmogorov-Smirnov (KS), which I'm told can compare two distributions. If the p-value of this test is low, it suggests that there is a difference in the two distributions.

In this example, the p value of r1, which is the data based on random selection varied widely, of course, but got as low as p = 0.2, and r2, which is based on ML, got p = 0.9. Neither distribution was significantly varied (p ≤ 0.05), but the ML model was better than random.

ks.test(
  r1$srv20[r1$grp=='test'],
  r1$srv20[r1$grp=='control'])

ks.test(
  r2$srv20[r2$class=='test'],
  r2$srv20[r2$class=='control'])

The real test, of course, is whether or not this gives cleaner discernment of a change in the post-intervention data by selecting an appropriate control group. In situations where I have many controls to select from, I do find this method persuasive for at least arguing that there was some methodology to the selection of controls and that any post-intervention difference between the tests and controls should be attributed to the intervention.

KNN isn't the only ML method out there and using ML of any flavor may not be appropriate, but it's something I'm trying to test and see if it improves the quality of any insights into the effectiveness of education or communication.

Vaccination Data

If you made it this far, you were curious about the vaccine data. The Public Use File ("PUF") data not only produces the aggregate file with the parameters we were using to build our model, but it also has data by HCPCS and NPI. I looked at all the appropriate HCPCS for pneumococcal vaccine administration for years 2021 and 2020, added those claims up, and subtracted one from the other. This data is enormous so it takes a few minutes to assemble.


# 2020 Pneumococcal Vaccines
puf_imm20 <- read.csv('MUP_PHY_R22_P05_V10_D20_Prov_Svc.csv') |> 
  filter(HCPCS_Cd %in% c('90670', '90671', '90677', '90732')) |>
  group_by(Rndrng_NPI) |> 
  summarise(srv20 = sum(if_else(is.na(Tot_Srvcs), 0, Tot_Srvcs)));

# 2021 Pneumococcal Vaccines
puf_imm21 <- read.csv('MUP_PHY_R23_P05_V10_D21_Prov_Svc.csv') |> 
  filter(HCPCS_Cd %in% c('90670', '90671', '90677', '90732')) |>
  group_by(Rndrng_NPI) |> 
  summarise(srv21 = sum(if_else(is.na(Tot_Srvcs), 0, Tot_Srvcs)));

# Combined with difference
puf_imm <- puf_imm20 |>
  full_join(puf_imm21, by=c("Rndrng_NPI" = "Rndrng_NPI")) |>
  mutate(
    srv21 = if_else(is.na(srv21), 0 , srv21),
    srv20 = if_else(is.na(srv20), 0 , srv20),
    svc_chg = srv21 - srv20) |>
  select(Rndrng_NPI, srv21, srv20, svc_chg);

To view or add a comment, sign in

Using Machine Learning To Pick a Control Group

Bryce Sady

Why would you do a post-hoc analysis?

Pneumococcal Vaccination Rates: Normally distributed?

Selecting Random Controls

Preparing Data to Train the Model With

Recommended by LinkedIn

Training and Running the Model

Model Validation

Vaccination Data

More articles by Bryce Sady

Others also viewed

Getting the Most out of MIPS Part 2: How to achieve 100% Score in Advancing Care Information Performance Category

Expanding roles for associations

LatAm Healthcare Pulse - Colombia January 2022

Pharmacy Savings with Transition to Capvaxive for Adult Pneumococcal Vaccination at a Large Health System

We win CPhA Patient Care Achievement Award For Health Promotion 2016

Salus Weekly Update - Juneteenth National Independence Day and an Updated Vaccine Policy - 18 June 2021

When the Health Recommendation System Fragments, What Happens to Biopharma Innovation?

The Home Health Care Today Show: S1, E1: COVID 19 Vaccines and Metro Detroit - Podcast/Webcast.

Unlocking Growth: The Driving Forces of Vaccine Administration Systems

Does the hospital matter?

Explore content categories

Why would you do a post-hoc analysis?

Pneumococcal Vaccination Rates: Normally distributed?

Selecting Random Controls

Preparing Data to Train the Model With

Recommended by LinkedIn

Training and Running the Model

Model Validation

Vaccination Data

More articles by Bryce Sady

100 Datasets for Healthcare Research (Part 2)

100 Public Data Sets for Healthcare Research (Part 1)

Using ChatGPT in Google Sheets

What happens to what you share with AI?

Using ChatGPT to Explore Claims Databases

What Americans Think About Their Health

Some Updates on Health System Affiliation

Thursday Tip #8. Topic Browsing with MeSH

TT #7. Finding the basics on hospitals

Thursday Tip #6: Affiliation and hierarchical information, including health systems

Others also viewed

Getting the Most out of MIPS Part 2: How to achieve 100% Score in Advancing Care Information Performance Category

Expanding roles for associations

LatAm Healthcare Pulse - Colombia January 2022

Pharmacy Savings with Transition to Capvaxive for Adult Pneumococcal Vaccination at a Large Health System

We win CPhA Patient Care Achievement Award For Health Promotion 2016

Salus Weekly Update - Juneteenth National Independence Day and an Updated Vaccine Policy - 18 June 2021

When the Health Recommendation System Fragments, What Happens to Biopharma Innovation?

The Home Health Care Today Show: S1, E1: COVID 19 Vaccines and Metro Detroit - Podcast/Webcast.

Unlocking Growth: The Driving Forces of Vaccine Administration Systems

Does the hospital matter?

Explore content categories