How do you test something that can't give the same answer twice?
You're testing a new AI-powered summary feature. You send it the same document three times. Run one: clean, three-sentence summary. Accurate. Well-structured. Run two: same document, slightly different phrasing. Still accurate. Run three: one sentence shorter. Technically correct. A little curt.
All three look… right.
You mark it as passed and move on, with the quiet, unsettled feeling of someone who just ate food that tasted fine but looked slightly wrong.
AI broke the one rule testing was built on.
For most of your career, testing has had one job. Compare the output to what you expected. Match means pass. No match means fail. Ship or fix. Next ticket.
Then LLMs showed up in your product. The same prompt started giving five different answers, all acceptable to a user, none identical to each other. The old framework quietly broke, and most teams haven't fully admitted it yet.
💡 Did you know? The State of AI in Software Testing 2026 report found that most QA teams are now testing at least one AI-powered feature, but very few have a formal evaluation process for non-deterministic outputs. The gap is where bugs live.
The Three Ways Teams Are Currently Handling This
Most teams don't start with a strategy. They start with a deadline. And when you're moving fast, you default to one of three instincts. None of them are wrong — they're stepping stones. The question is what to build on top of them once you've hit their limits
What actually works on Monday morning
Naming the monster is the first step. But you need a way to tame it before your next release. Here are three practices that move you away from vibes-based testing and toward actual engineering.
1. Build rubrics, not assertions
Stop asking does the output equal X? Start asking does the output meet our bar? For a support chatbot, you might score on factuality, relevance, tone, and safety. Use a 1–5 scale for each.
2. Use an AI Judge (and audit it)
You can use a stronger model to score your outputs at scale. It is fast and cheap. However, the judge has its own biases. It often likes longer answers or answers that match its own style.
Recommended by LinkedIn
3. Measure the spread
One run tells you nothing about a probabilistic system. You need to run the same prompt 10 or 20 times and look at the variance. High variance on a high-stakes question is a production incident waiting to happen
🧩 Scaling this: If you’re thinking about how to operationalize these ideas without adding more complexity, the AI-Driven Low Code Automation for QA (Apr 29) session shows how teams are doing it in real workflows.
The Room Where This Is Being Figured Out
If these strategies feel like a lot to handle alone, you should probably be in the room where the experts are talking. We are bringing the smartest people in testing together to solve exactly this at Breakpoint 2026.
Brittany Stewart shows you what she actually changed. Not theory. Real workflow. Keith Klain from KPMG asks the question your whole team is thinking but nobody says out loud. Avinash Ahuja from NVIDIA will make you rethink what your most valuable asset actually is. And Ashley Hunsberger talks about something everyone else skips, how to make the job feel worth doing again.
It's free, virtual, and the kind of thing your team will thank you for forwarding. Share it with your team and register now!
What's Hot at BrowserStack? 🔥
We hope you enjoyed reading this edition of The Quality Loop as much as we did curating it for you. Let us know in the comments below what you want to see in the upcoming editions, we're waiting! ❤️