How do you test something that can't give the same answer twice?

BrowserStack

The most comprehensive test platform with AI agents across the testing lifecycle. Open and flexible.

Published Apr 23, 2026

You're testing a new AI-powered summary feature. You send it the same document three times. Run one: clean, three-sentence summary. Accurate. Well-structured. Run two: same document, slightly different phrasing. Still accurate. Run three: one sentence shorter. Technically correct. A little curt.

All three look… right.

Confused math meme showing a woman surrounded by equations, representing the uncertainty and mental overload of testing non-deterministic AI outputs. — You stare at the green checkmarks. The green checkmarks stare back at you.

You mark it as passed and move on, with the quiet, unsettled feeling of someone who just ate food that tasted fine but looked slightly wrong.

AI broke the one rule testing was built on.

For most of your career, testing has had one job. Compare the output to what you expected. Match means pass. No match means fail. Ship or fix. Next ticket.

Then LLMs showed up in your product. The same prompt started giving five different answers, all acceptable to a user, none identical to each other. The old framework quietly broke, and most teams haven't fully admitted it yet.

💡 Did you know? The State of AI in Software Testing 2026 report found that most QA teams are now testing at least one AI-powered feature, but very few have a formal evaluation process for non-deterministic outputs. The gap is where bugs live.

The Three Ways Teams Are Currently Handling This

Most teams don't start with a strategy. They start with a deadline. And when you're moving fast, you default to one of three instincts. None of them are wrong — they're stepping stones. The question is what to build on top of them once you've hit their limits

AI testing challenges infographic covering the Squint Method, Freeze Method, and Metric Illusion, highlighting risks in evaluating non-deterministic outputs in modern software testing.

What actually works on Monday morning

Naming the monster is the first step. But you need a way to tame it before your next release. Here are three practices that move you away from vibes-based testing and toward actual engineering.

1. Build rubrics, not assertions

Stop asking does the output equal X? Start asking does the output meet our bar? For a support chatbot, you might score on factuality, relevance, tone, and safety. Use a 1–5 scale for each.

Visual showing the shift from traditional testing with strict assertions to AI testing using scoring across factuality, relevance, tone, and safety to evaluate non-deterministic outputs.

Example: Your test for a refund question should not check for exact words. It should check if the bot mentions the refund, gives a timeframe, and sounds professional.
The catch: Rubrics go stale. Revisit them every quarter, not every release. If you try to score 12 different things, you will end up scoring nothing well.

2. Use an AI Judge (and audit it)

You can use a stronger model to score your outputs at scale. It is fast and cheap. However, the judge has its own biases. It often likes longer answers or answers that match its own style.

Example: Your AI judge might give a 5 out of 5 because an answer sounds confident. A human reviewer might give it a 1 because the answer is factually wrong.
The catch: Calibrate your judge against human scores every two weeks. If they drift apart, stop trusting the judge until you retune it.

The Room Where This Is Being Figured Out

If these strategies feel like a lot to handle alone, you should probably be in the room where the experts are talking. We are bringing the smartest people in testing together to solve exactly this at Breakpoint 2026.

Breakpoint 2026 software testing conference featuring QA leaders Keith Klain, Ashley Hunsberger, Avinash Ahuja, and Brittany Stewart. Learn from top experts in AI testing and quality engineering, happening May 12–14, 2026.

Brittany Stewart shows you what she actually changed. Not theory. Real workflow. Keith Klain from KPMG asks the question your whole team is thinking but nobody says out loud. Avinash Ahuja from NVIDIA will make you rethink what your most valuable asset actually is. And Ashley Hunsberger talks about something everyone else skips, how to make the job feel worth doing again.

It's free, virtual, and the kind of thing your team will thank you for forwarding. Share it with your team and register now!

What's Hot at BrowserStack? 🔥

Quality isn't testing. Alan Page said it out loud. In the latest BrowserStack Talks, David Burns sits down with Alan to dig into why rewarding teams just for shipping fast is a system that breaks itself — and why curiosity will always beat AI when it comes to the hard problems. Worth your commute. Watch the episode!

The AI x Testing Bootcamp world tour just wrapped, and it got competitive. Teams went hands-on with the full product suite, skipped the theory, and walked away with things they could actually use the next day. The hackathon was the highlight. If you want the next one in your city, the waitlist is open now. Join the waitlist!

Teams collaborating during the AI x Testing Bootcamp, working hands-on with real testing workflows and tools during a live hackathon-style session focused on practical AI testing.

The Breakpoint 2026 agenda just dropped, and it's the one worth clearing your calendar for. Real practitioners, real problems. If your team is somewhere between "we're using AI" and "we've got AI figured out," this is where the gap gets closed. Save your seat!

Breakpoint 2026 virtual software testing conference focused on AI testing, quality engineering, and helping teams move from AI adoption to real-world implementation and scale.

We hope you enjoyed reading this edition of The Quality Loop as much as we did curating it for you. Let us know in the comments below what you want to see in the upcoming editions, we're waiting! ❤️

How do you test something that can't give the same answer twice?

BrowserStack

The most comprehensive test platform with AI agents across the testing lifecycle. Open and flexible.

AI broke the one rule testing was built on.

The Three Ways Teams Are Currently Handling This

What actually works on Monday morning

1. Build rubrics, not assertions

2. Use an AI Judge (and audit it)

Recommended by LinkedIn

3. Measure the spread

The Room Where This Is Being Figured Out

What's Hot at BrowserStack? 🔥

The Quality Loop

90,588 followers

More articles by BrowserStack

Others also viewed

Why Software Testers Won't Be Replaced by AI and ML Anytime Soon

Upgrading Testing Approach Using AI

Software Testing Roundup 15th to 21st April 2026

The Gas and the Brakes: Why AI Needs Contract Testing to Stay on Track

From Defense Systems to Self-Healing Scripts: Why the Future of QA is "Autonomous"

How Do You Test Software That Never Has the Same Conversation Twice?

Bridging Two Worlds: Traditional QA Practices vs. Testing AI (LLMs) — A Complete Guide

Concrete Example : How AI will replace Software Testers and revolutionize TNV

Will Software Tester Lose Their Jobs Because of AI?

From Shift-Left Testing to Shift-Left Intelligence

User-Centric Testing Strategies for AI-Generated Code

Key Principles for API and LLM Testing

How to Write Maintainable and Readable Tests

How to Build Reliable Test Scripts

How LLMs Handle Selective Reading Prompts

Explore content categories

AI broke the one rule testing was built on.

The Three Ways Teams Are Currently Handling This

What actually works on Monday morning

1. Build rubrics, not assertions

2. Use an AI Judge (and audit it)

Recommended by LinkedIn

3. Measure the spread

The Room Where This Is Being Figured Out

What's Hot at BrowserStack? 🔥

The Quality Loop

90,588 followers

More articles by BrowserStack

The QA checklist said green. The bug disagreed.

The State of AI in Software Testing 2026

5 Testing Habits to Break in 2026 (And What to Do Instead)

All I Want for Christmas Is My Test Suite to Pass

The Two QA Teams (And Why One's Winning)

Teaching Your Tests What You Actually Care About

Can You Really Check a Thousand Pages In a Single Coffee Break?

The Domino Effect of Accessibility (and How to Prevent It)

Wait… AI Can Do That in Testing?

Works on My Machine… Until It Hits Safari

Others also viewed

Why Software Testers Won't Be Replaced by AI and ML Anytime Soon

Upgrading Testing Approach Using AI

Software Testing Roundup 15th to 21st April 2026

The Gas and the Brakes: Why AI Needs Contract Testing to Stay on Track

From Defense Systems to Self-Healing Scripts: Why the Future of QA is "Autonomous"

How Do You Test Software That Never Has the Same Conversation Twice?

Bridging Two Worlds: Traditional QA Practices vs. Testing AI (LLMs) — A Complete Guide

Concrete Example : How AI will replace Software Testers and revolutionize TNV

Will Software Tester Lose Their Jobs Because of AI?

From Shift-Left Testing to Shift-Left Intelligence

Similar topics

User-Centric Testing Strategies for AI-Generated Code

Key Principles for API and LLM Testing

How to Write Maintainable and Readable Tests

How to Build Reliable Test Scripts

How LLMs Handle Selective Reading Prompts

Explore content categories