How to Know When to Trust Your Models and Your Experts: Bayesian Machine Learning - Part 2

How to Know When to Trust Your Models and Your Experts: Bayesian Machine Learning - Part 2

Introduction

In my recent article on Bayesian ML, I explored the problem of interpolation versus extrapolation in Frequentist ML and how Bayesian ML techniques can provide insight into how certain the model is in its predictions. As promised in Part 1, in this follow-up article, we are going to discuss more of the relative strengths of Bayesian ML when compared to the more broadly used Frequentist approach. To keep the articles from getting too long, I’ve decided to discuss the weaknesses in a Part 3 article. This is intended to help business leaders and ML practitioners know whether Bayesian ML is the correct approach for their teams and their careers.

Uncertainty On Top of Uncertainty - A Real-World Example

The idea of having humans supporting or reviewing model outputs is something everyone has likely heard of in 2025, but when exactly should this be happening? When should we feel comfortable allowing models to automate decisions and processes, and when should we fall back on our human experts to review more complex cases? To illustrate this, let’s consider an example where this comes up quite often: identity fraud detection.

Detecting identity fraud is an interesting and rewarding modeling scenario. In online bootcamps and on kaggle, fraud detection seems pretty straightforward and is almost always presented as a supervised classification problem. That is, we have a dataset where cases are labeled as “fraud” or “not fraud” and the model is trained to find the patterns in the data that allow it to label, or classify, other cases as either “fraud” or “not fraud.” This is directionally correct, at least most of the time, but the reality of modeling and detecting fraud in industry is quite a bit more nuanced.

First of all, any time you have consumers applying for credit or depository accounts, there are regulations that must be considered in the models used to automate or aid decisions. Many of these regulations pertain to fairness and transparency in lending practices, more on this in a moment, but there are also protectionary regulations dictating how exactly fraud can be documented. In the case of identity fraud, it might be more appropriate to say that there are some difficult restrictions on how fraud, or the suspicion of fraud, can be documented in data or communicated by a model.

Financial institutions in many countries are not allowed to explicitly label a case as fraud, even after thorough and specialized investigation. This is because legally at this stage the fraud is only alleged, and in countries like the US, citizens are considered innocent until proven guilty. Therefore, when looking through data on fraud investigation results, you’ll likely see more ambiguous labels, such as “Cleared,” “Cleared - No Investigation,” “Suspicious,” or “Highly Suspicious.” To add more uncertainty on top of uncertainty, fraud detection models or products are not allowed to return concrete determinations like “fraud” or “not fraud.”

This restriction on outputs is handled in a variety of ways by different teams and professionals. In my experience with detecting identity fraud, my models have been configured to return a score indicating how strongly the model suspects fraud is taking place in the context of a new account application. This way, we’re not returning a hard determination, but we are still providing useful guidance that allows the financial institution to make an informed decision about whether they should approve the given application.

At first, I was frustrated by these limitations, but then I realized this was a golden opportunity to create a more expressive, and ultimately more informative, model. If the model outputs were just “fraud” or “not fraud” labels, there would be no way for the model to communicate the level of certainty in a given prediction, which is a significant weakness given all the uncertainty inherent to this modeling scenario.

Configuring the models to provide a score, rather than a deterministic label, was extremely easy, as nearly all models used for binary classification arrive at a prediction by estimating the probability that the given observation is a “positive case.” In the context of fraud detection, a “positive case” would be "fraud." Thus, by returning the model’s estimated probability that the given application was fraudulent, we could return our “suspicion score.” The closer the suspicion score is to 1, the more likely the applicant is committing identity fraud.

To clarify, there were actually several models, each one specializing in a specific type of identity fraud, such as synthetic identity fraud, third-party fraud, and first-party fraud. It wasn’t as simple as just modeling “fraud.” Furthermore, each of these models was not a single model, but a stacked ensemble of models using diverse heuristics to capture the behaviors of the many different fraudster personas we identified. As you might imagine, there are a lot of people out there committing fraud with varying strategies and at varying scales. There’s really no way to capture all of that with any one model.

Even with a frequentist model, this suspicion score, when properly interpreted, could go a long way toward indicating some degree of certainty. For example, if you had a very low suspicion score, p < 0.1, then this seems to indicate the model is very certain that there is nothing fraudulent about the given application. On the other hand, if you had a very high suspicion score, p > 0.9, then this would indicate that the model is very confident that this application is being submitted by a fraudster.

What about the edge cases? What if the suspicion score is something like 0.45 < p < 0.55? A suspicion score of 0.5 would indicate that the given application seems as likely to be fraudulent as not. In this case, it makes sense that we should have a member of the Fraud Investigations team give this applicant a more nuanced review. Well, it seems like we have all of our bases covered. Or, do we?

Using Prediction Variance to Know When Human Intervention is Needed

The predictions coming out of our hypothetical frequentist model are estimated probabilities, but they are also still point predictions. There are hundreds of millions of credit profiles in the US alone, and there are so many unique circumstances. Fraudsters are also extremely innovative and clever. We've also covered that current regulations make the process of labeling fraud a little fuzzy. How sure is the model about these probability estimates? Well, with a frequentist model, we don’t know. To gain some additional insight into this, we once again turn to Bayesian ML.

From the perspective of Bayesian ML, this probability estimate is itself a random variable with a probability distribution. Just as discussed in Part 1, instead of a point prediction, our model outputs an entire distribution, and we can use the variance of that distribution to quantify how certain the model seems to be in its prediction.

Article content

In the above figure, we see several examples of what this might look like. We see that even in cases where the mean estimated probability of fraud is either very low or very high, the uncertainty can also be either very low or very high. To ensure each account application is given the proper amount of care and consideration, we can do some analysis to find a sensible threshold for prediction variance. If the variance of the output distribution is over this threshold, we can flag the application for human intervention. Obviously, there are multiple aspects of the situation to consider, such as the bandwidth of the Fraud Investigations team for manual reviews, but we now have much more information for determining when we should trust the model’s assessment, and when we should have a human intervene to review the case in question.

Additional Transparency in Model Predictions

In our example above, we mentioned regulatory requirements for model transparency. Let’s say a financial institution uses one of these models to receive guidance on an application and the model returns a suspicion score of p = 0.87 +/- 0.012, and based on this, the institution denies the application. In compliance with the Fair Credit Reporting Act (FCRA), the institution must now give an adverse action report explaining why the application was denied. This is one of the many times in the model’s lifecycle when the model may come under regulatory scrutiny.

When this happens, all aspects of the model will be scrutinized; I could write an entire separate article on the legal and regulatory review processes of models in the fraud detection space. One thing I will say in this article is that in my experience, the additional transparency and accountability for prediction certainty will be greatly appreciated by your legal team. Your legal team has to think about these regulatory reviews, and regulators will have a much easier time accepting the efficacy of your model when they see that there are guard rails in place that ensure every case is handled as carefully as possible and that the model is only trusted when that trust is warranted.

This is especially relevant to anyone looking to apply ML/AI in high-stakes use cases, such as healthcare and finance where we’re responsible for people’s lives and livelihoods. Regulatory requirements aside, there are certain outcomes that we as human beings care about deeply, and this extra transparency and accountability for prediction certainty is one of the ways we show our care for these situations. This is one of those things that makes a big difference to customers and shows them they can trust us.

Using the Output Distribution to Tailor Predictions for Specific Needs

The fact that the output of a Bayesian ML model is a distribution and not a single value allows us to use different measures of central tendency, or even entirely different measures from the distribution, to tailor model predictions to the needs of our clients. For example, revisiting our house price prediction example from Part 1, let’s say our client is a real estate investor, and let’s say this investor is a little more risk averse than most investors. No problem! Instead of basing our estimated home value on the expectation of the distribution, we could use the 45th, or maybe even the 40th, percentile of the predicted house price distribution to decrease the risk of overpaying and ending up underwater in a flip.

Similarly, let’s say we have a financial institution experiencing a heightened occurrence of fraud lately, and they’d like our fraud detection product to be more sensitive so that fraudulent applications don't slip through the cracks. Again, rather than using the expectation of the prediction distribution, using the 55th or 60th percentile of the distribution will decrease the risk of false negatives, especially in cases where model uncertainty is higher, as quantiles are spread further apart in these situations. Taking advantage of the fact that we have access to an entire distribution gives us a tremendous amount of flexibility to tailor our solutions to the needs of the business, or our clients, without the need for ad hoc retrainings or customized one-off deployments for particular clients.

Using Prior Probabilities to Insert Domain Expertise Into Your Models and Prevent Overfitting

Many AI/ML practitioners who are in the earlier stages of their careers tend to espouse a modeling philosophy of not imposing any heuristics on the model and allowing it to simply find the truth in the data. In the early years of my career, I was firmly in this camp. It’s an extremely appealing idea. It feels especially scientific and enlightened to acknowledge that everything you think you know about a domain might be wrong so you should just keep an open mind and follow the data.

Now that I’ve been doing this for longer, I’ve built and deployed a lot more models. I’ve seen projects struggle as the model performance seems to stubbornly resist reaching a minimum viable benchmark. I’ve seen client stakeholders test model outputs against their intuition formed by years of experience before deciding they can trust what we’ve built. I’ve seen the long, messy grind of figuring out how to make sense of the raw data and how it gets refined as it goes from bronze to silver, gold, and platinum, and the assumptions and expertise that get us there. All of this has taught me something important; even if it were unequivocally the right thing to do, it’s impossible to keep the data and the models completely free of our assumptions, experience, and knowledge.

The most successful models I’ve built in my career have been the ones guided by domain expertise. If you have access to said expertise, you would be a fool to not use it. There are many ways we can use expertise in any modeling scenario, but there is one way that we can only access in the context of Bayesian ML: the Prior. For those who are less familiar, Bayes’ Theorem has four components, the Evidence, the Likelihood, the Prior, and the Posterior. The Prior is a probability mass or density function (depending on whether the variable in question is discrete or continuous, respectively) that expresses our prior knowledge or beliefs.

In the context of Bayesian ML, specifically, the Prior represents prior knowledge or beliefs about the model parameters. This is an extremely important and rich aspect of the model for expressing domain expertise, which prevents our model from overfitting to the training data and shows it the pattern we need it to see when used properly. This can also help us create a successful model in situations where we have very limited training data. Frequentist models struggle terribly with small training sets, but an adept AI expert equipped with domain knowledge can make use of the Prior to guide the model through the training process, even in situations where the available data are extremely limited.

However, proper utilization of the Prior requires both a deep understanding of the mathematics of Bayesian Statistics and the domain being modeled to properly pull off. In this way, this access to the Prior can be thought of as a double-edged sword.

Stay tuned to explore this, and other potential weaknesses of Bayesian ML, in greater depth in Part 3.

To view or add a comment, sign in

Others also viewed

Explore content categories