Using Machine Learning To Pick a Control Group
Why would you do a post-hoc analysis?
The gold standard for experimental design involves a test versus control model, where test and control subjects are drawn from the same sample, a priori.
However, this is not always feasible or practical. For instance, in continuing medical education, the aims prevent the arbitrary exclusion of participants. So we compare those who participated versus those who did not. In healthcare delivery settings, sometimes we can only retrospectively analyze observations not controllable beforehand.
For those of us without budgets for time travel, there is machine learning.
In this experiment, I will train a model to select a control group based on 60 variables about healthcare professionals that gave pneumococcal vaccinations in 2020 and 2021.
Pneumococcal Vaccination Rates: Normally distributed?
Physicians' pneumococcal vaccination rates in 2020 and 2021 reveal a multimodal distribution in the absolute change in procedure volume. (*If you want to know how I got to this data set, scroll to the bottom.) In contrast, total procedures and number of patients seen are both the log-normal distributions.
So while a randomly selecting physicians might yield physicians with a similar distribution of practice sizes, there is a chance that they differ in some other aspect, contributing to the variation in immunization rates. So our hypothesis is that random selection shouldn't create groups of similar immunization behavior and we'll need ML or other tools to do so.
Selecting Random Controls
Let's say we wanted to know the impact of an education campaign on 4,000 physicians, where only 15% participated. Comparing participants to non-participants, there's a chance their vaccination rates before the intervention might vary and thus randomly selecting a control group from non-participants may lead to a larger observed effect on the test group compared to the control, potentially suggesting a greater impact than what a larger sample might reveal. Alternatively, it might even lead to the conclusion that the intervention had no positive benefit at all.
If we look at the graph below, we'll see that the test group indeed has a different distribution of vaccination rates before the intervention from the randomly selected controls. Changing the size of the control pool doesn't guarantee an improvement in how the distributions match.
Preparing Data to Train the Model With
So let's try using ML, instead. We will draw on a dataset with 60+ variables which outline how many patients each healthcare professional saw, the demographics of those patients, the types of procedures and reimbursement for those claims, their specialty, and some other setting details. This is from the Medicare Part B Provider aggregate data from 2021. This data really should be supplemented by some data from the Compare dataset, but for today, let's just stick with this.
We're going to train the model on a randomly selected group of 600 participants that we'll call the "test" group, and we'll label everyone else "control". We'll then ask the trained model to label the "controls" for their predicted group, and anyone in the model predicts to be a "test" will be selected as the actual "control" group. If the model is good, there should be very little variance in distribution of immunization rates between the groups even though that's not one of the variables in the model.
There are a couple of machine learning strategies to choose from, and today I'm going to use k-Nearest Neighbor (KNN), which is appropriate for numeric-rich data. So I'm going to take the data on physician practice demographics (puf_agg)and filter it down to just those that did pneumococcal immunizations in 2021 or 2022 (puf_imm), remove geographic data that is not in the 50 states, select a random 600 participants and label them as "test" or "control". I'll them convert all the other key data to numeric formats, including specialty, and then remove any data we don't want to include.
Finally, I'll set NAs to zero, which, for our purposes, isn't consequential.
Recommended by LinkedIn
train <- puf_agg %>%
filter(Rndrng_NPI %in% puf_imm$Rndrng_NPI) |>
filter(!Rndrng_Prvdr_State_FIPS %in%
c('9A','9B','9C','9D','9E')) |>
mutate(
class = as.factor(
ifelse(row_number() %in% sample(row_number(), 600),
"test", "control")),
Rndrng_Prvdr_State_FIPS = as.numeric(Rndrng_Prvdr_State_FIPS),
Rndrng_Prvdr_Zip5 = as.numeric(Rndrng_Prvdr_Zip5),
Rndrng_Prvdr_Gndr = ifelse(Rndrng_Prvdr_Gndr == "M", 1, 2),
Rndrng_Prvdr_Type = as.numeric(
factor(Rndrng_Prvdr_Type,
levels = unique(Rndrng_Prvdr_Type)))) |>
select(
-c(Rndrng_Prvdr_Crdntls,
Rndrng_Prvdr_State_Abrvtn,
Rndrng_Prvdr_RUCA_Desc,
Rndrng_Prvdr_Mdcr_Prtcptg_Ind,
Drug_Sprsn_Ind,
Med_Sprsn_Ind)
);
train[is.na(train)] <- 0;
Training and Running the Model
Now we have a training set to pass to our model, which is the full set of tests and controls, unlabeled, and then the test set which, is, confusingly, my controls (because I'm trying to predict tests in this group), and finally the column in the training set that has the labels (the classification data). You'll notice that I don't include the NPI (which is notably void of any meaning) in the training or test dataframes passed to knn(). Finally, I choose a k value of 2. This means that the neighbors that will be considered to be "close" is just 2. You may find, if you use a dataset other than this one, that you need to change it, especially if you are not getting many positive "test" predictions. If you are getting too few values, then you need to reduce the number of variables in your training set. You can also re-run the model.
The resulting model I can use to add a new variable, predicted_class to the set of candidate controls. I then can filter for anyone that is not the class of "test" but is the predicted class of "test." This means that they are the most similar to the "test" group and thus can be used a very close "control," given how similar they are on other parameters.
# create set of control candidates from data
controls <- train |> filter(class == 'control')
# train the model
knn_model <-
knn(train = train |>
select(-c(Rndrng_NPI, class)),
test = controls |>
select(-c(Rndrng_NPI, class)),
cl = train$class,
k = 2);
# run the model
controls$predicted_class <- knn_model
# view results
controls |>
group_by(class, predicted_class) |>
summarise(n())
# class predicted_class n()
# -------------------------------------
# control control 40366
# control test 308
Since this model is running over 40,000+ HCPs it takes a few minutes to run. In the end, it identifies >300 controls.
We should observe then, that the "pre" or 2021 immunization rate distributions should be similar between the "test" group and the ML-selected "control" group. And, because we didn't actually do any intervention on the randomly-selected "test" group, we shouldn't see any significant difference between the test and control group in terms of immunization rate change.
Model Validation
If the visual comparison of the distributions isn't convincing, there is a test, the Kolmogorov-Smirnov (KS), which I'm told can compare two distributions. If the p-value of this test is low, it suggests that there is a difference in the two distributions.
In this example, the p value of r1, which is the data based on random selection varied widely, of course, but got as low as p = 0.2, and r2, which is based on ML, got p = 0.9. Neither distribution was significantly varied (p ≤ 0.05), but the ML model was better than random.
ks.test(
r1$srv20[r1$grp=='test'],
r1$srv20[r1$grp=='control'])
ks.test(
r2$srv20[r2$class=='test'],
r2$srv20[r2$class=='control'])
The real test, of course, is whether or not this gives cleaner discernment of a change in the post-intervention data by selecting an appropriate control group. In situations where I have many controls to select from, I do find this method persuasive for at least arguing that there was some methodology to the selection of controls and that any post-intervention difference between the tests and controls should be attributed to the intervention.
KNN isn't the only ML method out there and using ML of any flavor may not be appropriate, but it's something I'm trying to test and see if it improves the quality of any insights into the effectiveness of education or communication.
Vaccination Data
If you made it this far, you were curious about the vaccine data. The Public Use File ("PUF") data not only produces the aggregate file with the parameters we were using to build our model, but it also has data by HCPCS and NPI. I looked at all the appropriate HCPCS for pneumococcal vaccine administration for years 2021 and 2020, added those claims up, and subtracted one from the other. This data is enormous so it takes a few minutes to assemble.
# 2020 Pneumococcal Vaccines
puf_imm20 <- read.csv('MUP_PHY_R22_P05_V10_D20_Prov_Svc.csv') |>
filter(HCPCS_Cd %in% c('90670', '90671', '90677', '90732')) |>
group_by(Rndrng_NPI) |>
summarise(srv20 = sum(if_else(is.na(Tot_Srvcs), 0, Tot_Srvcs)));
# 2021 Pneumococcal Vaccines
puf_imm21 <- read.csv('MUP_PHY_R23_P05_V10_D21_Prov_Svc.csv') |>
filter(HCPCS_Cd %in% c('90670', '90671', '90677', '90732')) |>
group_by(Rndrng_NPI) |>
summarise(srv21 = sum(if_else(is.na(Tot_Srvcs), 0, Tot_Srvcs)));
# Combined with difference
puf_imm <- puf_imm20 |>
full_join(puf_imm21, by=c("Rndrng_NPI" = "Rndrng_NPI")) |>
mutate(
srv21 = if_else(is.na(srv21), 0 , srv21),
srv20 = if_else(is.na(srv20), 0 , srv20),
svc_chg = srv21 - srv20) |>
select(Rndrng_NPI, srv21, srv20, svc_chg);