How do you test something that can't give the same answer twice?

How do you test something that can't give the same answer twice?

You're testing a new AI-powered summary feature. You send it the same document three times. Run one: clean, three-sentence summary. Accurate. Well-structured. Run two: same document, slightly different phrasing. Still accurate. Run three: one sentence shorter. Technically correct. A little curt.

All three look… right.

Confused math meme showing a woman surrounded by equations, representing the uncertainty and mental overload of testing non-deterministic AI outputs.
You stare at the green checkmarks. The green checkmarks stare back at you.

You mark it as passed and move on, with the quiet, unsettled feeling of someone who just ate food that tasted fine but looked slightly wrong.

AI broke the one rule testing was built on.

For most of your career, testing has had one job. Compare the output to what you expected. Match means pass. No match means fail. Ship or fix. Next ticket.

Then LLMs showed up in your product. The same prompt started giving five different answers, all acceptable to a user, none identical to each other. The old framework quietly broke, and most teams haven't fully admitted it yet.


💡 Did you know? The State of AI in Software Testing 2026 report found that most QA teams are now testing at least one AI-powered feature, but very few have a formal evaluation process for non-deterministic outputs. The gap is where bugs live.


The Three Ways Teams Are Currently Handling This

Most teams don't start with a strategy. They start with a deadline. And when you're moving fast, you default to one of three instincts. None of them are wrong — they're stepping stones. The question is what to build on top of them once you've hit their limits

AI testing challenges infographic covering the Squint Method, Freeze Method, and Metric Illusion, highlighting risks in evaluating non-deterministic outputs in modern software testing.

What actually works on Monday morning

Naming the monster is the first step. But you need a way to tame it before your next release. Here are three practices that move you away from vibes-based testing and toward actual engineering.

1. Build rubrics, not assertions

Stop asking does the output equal X? Start asking does the output meet our bar? For a support chatbot, you might score on factuality, relevance, tone, and safety. Use a 1–5 scale for each.

Visual showing the shift from traditional testing with strict assertions to AI testing using scoring across factuality, relevance, tone, and safety to evaluate non-deterministic outputs.

  • Example: Your test for a refund question should not check for exact words. It should check if the bot mentions the refund, gives a timeframe, and sounds professional.
  • The catch: Rubrics go stale. Revisit them every quarter, not every release. If you try to score 12 different things, you will end up scoring nothing well.

2. Use an AI Judge (and audit it)

You can use a stronger model to score your outputs at scale. It is fast and cheap. However, the judge has its own biases. It often likes longer answers or answers that match its own style.

  • Example: Your AI judge might give a 5 out of 5 because an answer sounds confident. A human reviewer might give it a 1 because the answer is factually wrong.
  • The catch: Calibrate your judge against human scores every two weeks. If they drift apart, stop trusting the judge until you retune it.

3. Measure the spread

One run tells you nothing about a probabilistic system. You need to run the same prompt 10 or 20 times and look at the variance. High variance on a high-stakes question is a production incident waiting to happen

  • Example: You run a return policy prompt 15 times. If 14 say 30 days and one says it is flexible, you found a bug that only appears 7% of the time.
  • The catch: This is expensive. Do not run every test 20 times. Pick your most important prompts—billing, policy, safety—for this heavy lifting.


🧩 Scaling this: If you’re thinking about how to operationalize these ideas without adding more complexity, the AI-Driven Low Code Automation for QA (Apr 29) session shows how teams are doing it in real workflows.


The Room Where This Is Being Figured Out

If these strategies feel like a lot to handle alone, you should probably be in the room where the experts are talking. We are bringing the smartest people in testing together to solve exactly this at Breakpoint 2026.

Breakpoint 2026 software testing conference featuring QA leaders Keith Klain, Ashley Hunsberger, Avinash Ahuja, and Brittany Stewart. Learn from top experts in AI testing and quality engineering, happening May 12–14, 2026.

Brittany Stewart shows you what she actually changed. Not theory. Real workflow. Keith Klain from KPMG asks the question your whole team is thinking but nobody says out loud. Avinash Ahuja from NVIDIA will make you rethink what your most valuable asset actually is. And Ashley Hunsberger talks about something everyone else skips, how to make the job feel worth doing again.

It's free, virtual, and the kind of thing your team will thank you for forwarding. Share it with your team and register now!


What's Hot at BrowserStack? 🔥

  • Quality isn't testing. Alan Page said it out loud. In the latest BrowserStack Talks, David Burns sits down with Alan to dig into why rewarding teams just for shipping fast is a system that breaks itself — and why curiosity will always beat AI when it comes to the hard problems. Worth your commute. Watch the episode!

  • The AI x Testing Bootcamp world tour just wrapped, and it got competitive. Teams went hands-on with the full product suite, skipped the theory, and walked away with things they could actually use the next day. The hackathon was the highlight. If you want the next one in your city, the waitlist is open now. Join the waitlist!

Teams collaborating during the AI x Testing Bootcamp, working hands-on with real testing workflows and tools during a live hackathon-style session focused on practical AI testing.

  • The Breakpoint 2026 agenda just dropped, and it's the one worth clearing your calendar for. Real practitioners, real problems. If your team is somewhere between "we're using AI" and "we've got AI figured out," this is where the gap gets closed. Save your seat!

Breakpoint 2026 virtual software testing conference focused on AI testing, quality engineering, and helping teams move from AI adoption to real-world implementation and scale.

We hope you enjoyed reading this edition of The Quality Loop as much as we did curating it for you. Let us know in the comments below what you want to see in the upcoming editions, we're waiting! ❤️

To view or add a comment, sign in

More articles by BrowserStack

Others also viewed

Explore content categories