Automated LLM Testing Without Human Input

Explore top LinkedIn content from expert professionals.

Summary

Automated LLM testing without human input is a process where AI systems, like large language models (LLMs), are tested and evaluated by other AI agents or automated frameworks instead of relying on manual human review. This allows developers to assess and improve LLM performance, accuracy, and reliability at scale, often using synthetic data and intelligent judgment models.

  • Build automated evaluators: Develop AI-powered tools that can judge the quality, correctness, and intent of LLM responses without needing human-labeled examples.
  • Simulate real users: Use persona-rich AI agents to mimic user interactions and test LLM behavior across different scenarios, helping uncover issues before product launch.
  • Iterate and fine-tune: Continuously improve LLM testing pipelines by generating synthetic test cases, addressing new faults, and refining models based on automated feedback.
Summarized by AI based on LinkedIn member posts
  • View profile for Shrey Shah

    AI @ Microsoft | I teach harness engineering | Cursor Ambassador | V0 Ambassador

    16,881 followers

    Test Automation for AI Agents? I didn’t believe it until I read this paper. And it completely changed how I think about testing for AI-native products. Here's what blew my mind ↓ Researchers built something called AgentA/B — an automated A/B testing framework where LLM agents replace human traffic. No users. No analytics delays. Just 1,000+ AI agents navigating real websites like Amazon. They: → Search like humans → Click with intention → Filter and purchase → All on live DOMs using Selenium + structured reasoning These aren’t prompt-based toys. They’re persona-rich, behavior-driven agents making real choices in real time. And the kicker? The results actually matched human behavior. Agents exposed to better UX: → Clicked more → Used filters more → Bought more → Spent more ($60.99 vs $55.14) One stat was even statistically significant. That’s wild for synthetic traffic. Here’s what this means for SDETs & AI Agent devs: → You can now test UX before launch → Catch issues early, without waiting on live traffic → Simulate edge cases and underrepresented users → Reduce cost, risk, and iteration cycles It’s a whole new dimension of test automation. Not just pass/fail. But: Did the agent behave like a real user? AgentA/B shows how AI can augment A/B testing—not replace it. If you're working on AI agents, this paper is 100% worth the read. Let me know and I’ll drop you the link. ♻️ Repost this if you’re curious about where test automation is going next. PS: Want to stay updated on AI x Testing? Scroll to the top. Follow Shrey Shah to never miss a post. #AI #TestAutomation #GenerativeAI #LLM #AIAgents #SDET #LangChain #Selenium #ABTesting #AgentSimulation

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,028 followers

    Researchers from Meta have developed a "Self-Taught Evaluator" that can significantly improve the accuracy of large language models (LLMs) in judging the quality of AI-generated responses—without using any human-labeled data! So how do you create a Self-Taught Evaluator without using human-labeled data? 1. Initialization   - Start with a large set of human-written user instructions (e.g., from production systems).   - Select an initial seed large language model (LLM). 2. Instruction Selection     - Categorize and filter the instructions using an LLM to select a balanced, challenging subset.   - Focus on categories like reasoning, coding, etc. 3. Response Pair Construction   - For each selected instruction:    - Generate a high-quality baseline response using the seed LLM.    - Create a "noisy" version of the original instruction.    - Generate a response to the noisy instruction, which will serve as a lower-quality response.   - This creates synthetic preference pairs without human labeling. 4. Judgment Annotation   - Use the current LLM-as-a-Judge model to generate multiple reasoning traces and judgments for each example.   - Apply rejection sampling to keep only correct judgments.   - If no correct judgment is found, discard the example. 5. Model Fine-tuning   - Fine-tune the seed LLM on the collected synthetic judgments.   - This creates an improved LLM-as-a-Judge model. 6. Iterative Improvement   - Repeat steps 4-5 multiple times, using the improved model from each iteration.   - As the model improves, it should generate more correct judgments, creating a curriculum effect. 7. Evaluation   - Test the final model on benchmarks like RewardBench, MT-Bench, etc.   - Optionally, use majority voting at inference time for improved performance. This approach allows the creation of a strong evaluator model without relying on costly human-labeled preference data, while still achieving competitive performance compared to models trained on human annotations. What are your thoughts on self-taught AI evaluators? How might this impact the future of AI development?

  • View profile for Imran Qureshi

    CTO & Chief AI Officer @ b.well Connected Health, former Clarify Health, Health Catalyst & Microsoft

    7,113 followers

    𝗛𝗼𝘄 𝗱𝗼 𝘆𝗼𝘂 𝘁𝗲𝘀𝘁 𝗔𝗜 𝘄𝗵𝗲𝗻 𝗔𝗜 𝗶𝘀𝗻’𝘁 𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰? At 𝗯.𝘄𝗲𝗹𝗹, we ran into a problem many teams building AI face: 👉 𝘏𝘰𝘸 𝘥𝘰 𝘺𝘰𝘶 𝘳𝘦𝘭𝘪𝘢𝘣𝘭𝘺 𝘵𝘦𝘴𝘵 𝘈𝘐 𝘢𝘯𝘴𝘸𝘦𝘳𝘴? In traditional software, testing is straightforward. You pass in input → verify a deterministic output. But AI and LLMs don’t work that way. Ask the same question twice and you might get: ● Different wording ● Different structure ● Different—but still correct—answers So classic assertion-based tests break down. 𝗪𝗵𝘆 𝘀𝘁𝗿𝗶𝗻𝗴 𝗺𝗮𝘁𝗰𝗵𝗶𝗻𝗴 𝗱𝗼𝗲𝘀𝗻’𝘁 𝘄𝗼𝗿𝗸 One approach is to match on keywords (e.g., “LDL”). But that fails fast: ● One response says “𝗟𝗗𝗟” ● Another says “𝗹𝗼𝘄-𝗱𝗲𝗻𝘀𝗶𝘁𝘆 𝗹𝗶𝗽𝗼𝗽𝗿𝗼𝘁𝗲𝗶𝗻” Same meaning. Different text. Test fails. >> 𝗧𝗵𝗲 𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻: 𝗨𝘀𝗲 𝗔𝗜 𝘁𝗼 𝘁𝗲𝘀𝘁 𝗔𝗜 When we built the 𝗛𝗲𝗮𝗹𝘁𝗵 𝗦𝗗𝗞 𝗳𝗼𝗿 𝗔𝗜 (used by customers like ChatGPT), we flipped the model. Instead of forcing deterministic checks, we: ● Wrote 𝘁𝗲𝘀𝘁 𝗽𝗹𝗮𝗻𝘀 ● Had an LLM 𝗲𝘅𝗲𝗰𝘂𝘁𝗲 𝘁𝗵𝗲 𝗽𝗹𝗮𝗻 ● Asked another LLM to 𝗲𝘃𝗮𝗹𝘂𝗮𝘁𝗲 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀 In other words: 👉 𝘈𝘐 𝘦𝘷𝘢𝘭𝘶𝘢𝘵𝘦𝘴 𝘸𝘩𝘦𝘵𝘩𝘦𝘳 𝘈𝘐 𝘣𝘦𝘩𝘢𝘷𝘦𝘥 𝘤𝘰𝘳𝘳𝘦𝘤𝘵𝘭𝘺. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗮𝘁 𝗹𝗼𝗼𝗸𝘀 𝗹𝗶𝗸𝗲 𝗶𝗻 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲 We give the LLM a structured plan like: ---- 𝗥𝘂𝗻 𝘁𝗵𝗶𝘀 𝗽𝗹𝗮𝗻. 𝗖𝗼𝗺𝗽𝗮𝗿𝗲 𝗲𝗮𝗰𝗵 𝗼𝘂𝘁𝗽𝘂𝘁 𝘁𝗼 𝗲𝘅𝗽𝗲𝗰𝘁𝗲𝗱 𝗿𝗲𝘀𝘂𝗹𝘁𝘀 𝗮𝗻𝗱 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗲 𝗮 𝗿𝗲𝗽𝗼𝗿𝘁 𝗰𝗮𝗿𝗱. 1. Get patient summary 2. Show all BP readings 3. Show BP readings from 2024 4. Export weight readings (last year) as CSV 5. Show all HbA1c readings 6. Show cholesterol results from 2024 7. Get list of visits 8. Search notes for “LDL” 𝗘𝘅𝗽𝗲𝗰𝘁𝗲𝗱 results are defined 𝘴𝘦𝘮𝘢𝘯𝘵𝘪𝘤𝘢𝘭𝘭𝘺, not textually: ● “6 or more BP readings” ● “Patient summary mentions hyperlipidemia” ● “Visit exists on August 22, 2024” ● “LDL mentioned in Feb 27, 2015 note” ---- (Note: This is a simplified test suite. The full test suite tests much more.) The evaluator LLM checks 𝗶𝗻𝘁𝗲𝗻𝘁, 𝗰𝗼𝗿𝗿𝗲𝗰𝘁𝗻𝗲𝘀𝘀, 𝗮𝗻𝗱 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗻𝗲𝘀𝘀 — not exact wording. 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀 ● Works with 𝗻𝗼𝗻-𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰 𝗼𝘂𝘁𝗽𝘂𝘁𝘀 ● Scales across 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗟𝗟𝗠𝘀 ● Can be automated via APIs and CI tools (GitHub Actions, etc.) 𝗥𝗲𝘀𝘂𝗹𝘁: AI systems you can actually trust in production. --- We now test AI 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝘄𝗮𝘆 𝘄𝗲 𝘂𝘀𝗲 𝗶𝘁 — by reasoning, not string matching. 𝗖𝘂𝗿𝗶𝗼𝘂𝘀 𝗵𝗼𝘄 𝗼𝘁𝗵𝗲𝗿𝘀 𝗮𝗿𝗲 𝗮𝗽𝗽𝗿𝗼𝗮𝗰𝗵𝗶𝗻𝗴 𝗔𝗜 𝘁𝗲𝘀𝘁𝗶𝗻𝗴. How are 𝘺𝘰𝘶 validating LLM behavior today? #AI #LLMs #AITesting #SoftwareEngineering #HealthTech

  • View profile for Akhil Sharma

    System Design · AI Architecture · Distributed Systems

    24,367 followers

    Your unit tests mean nothing for LLM features. assert output == expected That line of code — the foundation of every software test you’ve ever written — is useless the moment your system produces non-deterministic output. And most teams shipping AI features right now have no idea what to replace it with. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ December 2023. A Chevrolet dealership in California deployed a GPT-4-powered customer service chatbot on their website. Within days, users had prompt-engineered it into agreeing to sell a 2024 Chevy Tahoe — a $58,000 vehicle — for $1. The bot said, and I quote: “that’s a legally binding offer — no takesies backsies.” The screenshots went viral. The model was doing exactly what a poorly evaluated chatbot does: it had no output guardrails, no adversarial testing, and no system checking whether its responses made any sense before they reached customers. This is what happens when you ship an LLM feature with no evaluation pipeline. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The most common response from engineers new to LLM work is to reach for BLEU or ROUGE scores. These are the standard NLP metrics — they measure how much the generated text overlaps with a reference answer. They don’t work. Consider these two responses to the same question: Reference: “The server crashed due to a memory leak” Generated: “A memory leak caused the application to go down” These mean the same thing. A human reads both and nods. ROUGE gives the second one a score of 0.22 — nearly zero — because the words don’t overlap. The metric is measuring the wrong thing entirely. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ What actually works: a three-layer stack. Layer 1 — Deterministic checks. Free, fast, CI-friendly. Does the response refuse when it shouldn’t? Is the JSON valid? Is it hallucinating URLs? These run in milliseconds on every PR. They catch structural failures before anything else. Layer 2 — LLM-as-judge. This sounds circular. You’re using an AI to evaluate an AI. But it works because evaluation is easier than generation. Use pairwise comparison instead of a 1-5 scale — “which response is better, A or B” — and validate that the judge agrees with humans on 50-100 examples before you trust it. Layer 3 — Human review on 2% of traffic. Expensive. Focused on the queries that the automated layers flag as low confidence. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The brutal truth: Every prompt change you ship is a regression test you didn’t run. LLM systems fail silently. Your monitoring shows 200 OK and 120ms latency. Meanwhile the model has quietly started refusing queries it handled fine last week. You don’t find out until a user complains. The teams getting this right treat their eval dataset as a first-class artifact alongside their code. Full article — the full three-layer implementation, prompt regression testing in CI Link in comments ↓ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ #SystemDesign #AIEngineering #LLM #MachineLearning

  • View profile for Nathan Benaich
    Nathan Benaich Nathan Benaich is an Influencer

    investing

    51,172 followers

    Mutation-Guided LLM-based Test Generation at Meta As a next step to last year's super cool Meta paper on LLMs generating tests, here we have it. Testing has moved, finally, beyond mere coverage. The guarantees are a lot stronger too, because automated compliance hardner always give examples of the specific kinds of faults that its tests will find (rather than just claiming more line coverage, which they also can do anyway). Abstract: "This paper1 describes Meta’s ACH system for mutation-guided LLM-based test generation. ACH generates relatively few mutants (aka simulated faults), compared to traditional mutation testing. Instead, it focuses on generating currently undetected faults that are specific to an issue of concern. From these currently uncaught faults, ACH generates tests that can catch them, thereby ‘killing’ the mutants and consequently hardening the platform against regressions. We use privacy concerns to illustrate our approach, but ACH can harden code against any type of regression. In total, ACH was applied to 10,795 Android Kotlin classes in 7 software platforms deployed by Meta, from which it generated 9,095 mutants and 571 privacy-hardening test cases. ACH also deploys an LLM-based equivalent mutant detection agent that achieves a precision of 0.79 and a recall of 0.47 (rising to 0.95 and 0.96 with simple preprocessing). ACH was used by Messenger and WhatsApp test-athons where engineers accepted 73% of its tests, judging 36% to privacy relevant. We conclude that ACH hardens code against specific concerns and that, even when its tests do not directly tackle the specific concern, engineers find them useful for their other benefits." https://lnkd.in/dyAn3G_k

  • View profile for Rudina Seseri
    Rudina Seseri Rudina Seseri is an Influencer

    Venture Capital | Technology | Board Director

    20,452 followers

    For years, fine-tuning LLMs has required large amounts of data and human oversight. Small improvements can disrupt existing systems, requiring humans to go through and flag errors in order to fit the model to pre-existing workflows. This might work for smaller use cases, but it is clearly unsustainable at scale. However, recent research suggests that everything may be about to change. I have been particularly excited about two papers from Anthropic and Massachusetts Institute of Technology, which propose new methods that enable LLMs to reflect on their own outputs and refine performance without waiting for humans. Instead of passively waiting for correction, these models create an internal feedback loop, learning from their own reasoning in a way that could match, or even exceed, traditional supervised training in certain tasks. If these approaches mature, they could fundamentally reshape enterprise AI adoption. From chatbots that continually adjust their tone to better serve customers to research assistants that independently refine complex analyses, the potential applications are vast. In today’s AI Atlas, I explore how these breakthroughs work, where they could make the most immediate impact, and what limitations we still need to overcome.

  • View profile for Paul Iusztin

    Senior AI Engineer • Founder @ Decoding AI • Author @ LLM Engineer’s Handbook ~ I ship AI products and teach you about the process.

    98,323 followers

    LLM systems don’t fail silently. They fail invisibly. No trace, no metrics, no alerts - just wrong answers and confused users. That’s why we architected a complete observability pipeline in the Second Brain AI Assistant course. Powered by Opik from Comet, it covers two key layers: 𝟭. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 → Tracks full prompt traces (inputs, outputs, system prompts, latencies) → Visualizes chain execution flows and step-level timing → Captures metadata like model IDs, retrieval config, prompt templates, token count, and costs Latency metrics like: Time to First Token (TTFT) Tokens per Second (TPS) Total response time ...are logged and analyzed across stages (pre-gen, gen, post-gen). So when your agent misbehaves, you can see exactly where and why. 𝟮. 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 → Runs automated tests on the agent’s responses → Uses LLM judges + custom heuristics (hallucination, relevance, structure) → Works offline (during dev) and post-deployment (on real prod samples) → Fully CI/CD-ready with performance alerts and eval dashboards It’s like integration testing, but for your RAG + agent stack. The best part? → You can compare multiple versions side-by-side → Run scheduled eval jobs on live data → Catch quality regressions before your users do This is Lesson 6 of the course (and it might be the most important one). Because if your system can’t measure itself, it can’t improve. 🔗 Full breakdown here: https://lnkd.in/dA465E_J

  • View profile for Brianna Bentler

    I help owners and coaches start with AI | AI news you can use | Women in AI

    15,083 followers

    Stop debating if your AI is “good.” Measure it. LLM-as-a-Judge is how operators do it at scale. It turns fuzzy reviews into consistent scores. So you can ship, improve, and prove ROI. What it is: an LLM that grades other outputs. When to use: subjective, multi-criteria, high volume. When not to: clear ground truth or legal high-stakes. Three ways to score, pick one: ✅Single-output with a reference for grounding. ✅Single-output without a reference for style fit. ✅Pairwise comparisons for ranking variants. Bias is real. Plan for it. Length, order, authority, and self-favor creep in. Mitigate with controls, not vibes. ✅ Randomize candidate order. ✅ Cap and normalize length. ✅ Hide sources and identities. ✅ Use 3 judges and average. Make the judge predictable. Write criteria in plain language. Force a strict JSON schema for scores. Reject outputs that break the schema. Require a brief rationale with evidence. Then validate the validator. Test on easy, tricky, and adversarial cases. Track precision, recall, AUROC, and agreement. Run it next to humans and compare. Scale without breaking the bank. Use a small evaluation model for real-time checks. Spot-audit with a larger model weekly. Operators: start with one workflow this week. Ship the judge, log every decision, improve weekly. Save this and share with one teammate who owns QA.

  • View profile for Hao Hoang

    Daily AI Interview Questions | Senior AI Researcher & Engineer | ML, LLMs, NLP, DL, CV, ML Systems | 56k+ AI Community

    55,203 followers

    𝘈𝘐'𝘴 𝘣𝘪𝘨𝘨𝘦𝘴𝘵 𝘭𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘤𝘩𝘢𝘭𝘭𝘦𝘯𝘨𝘦 𝘪𝘴𝘯'𝘵 𝘴𝘰𝘭𝘷𝘪𝘯𝘨 𝘱𝘳𝘰𝘣𝘭𝘦𝘮𝘴; 𝘪𝘵'𝘴 𝘬𝘯𝘰𝘸𝘪𝘯𝘨 𝘸𝘩𝘢𝘵'𝘴 "𝘳𝘪𝘨𝘩𝘵" 𝘸𝘪𝘵𝘩𝘰𝘶𝘵 𝘢 𝘩𝘶𝘮𝘢𝘯 𝘢𝘯𝘴𝘸𝘦𝘳 𝘬𝘦𝘺. 𝘞𝘩𝘢𝘵 𝘪𝘧 𝘢 𝘮𝘰𝘥𝘦𝘭 𝘤𝘰𝘶𝘭𝘥 𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘦 𝘪𝘵𝘴 𝘰𝘸𝘯 𝘩𝘪𝘨𝘩-𝘲𝘶𝘢𝘭𝘪𝘵𝘺 𝘴𝘶𝘱𝘦𝘳𝘷𝘪𝘴𝘪𝘰𝘯, 𝘦𝘷𝘦𝘯 𝘸𝘩𝘦𝘯 𝘢𝘭𝘭 𝘰𝘧 𝘪𝘵𝘴 𝘪𝘯𝘪𝘵𝘪𝘢𝘭 𝘢𝘵𝘵𝘦𝘮𝘱𝘵𝘴 𝘢𝘳𝘦 𝘧𝘭𝘢𝘸𝘦𝘥? 𝘈 𝘯𝘦𝘸 𝘱𝘢𝘱𝘦𝘳 𝘧𝘳𝘰𝘮 Meta & University of Oxford 𝘴𝘩𝘰𝘸𝘴 𝘩𝘰𝘸. Scaling human annotation for complex, non-verifiable tasks (like clinical advice or creative writing) is always a bottleneck. Current "AI judge" methods are known to be inconsistent and biased, stalling progress in nuanced domains. A new paper from Meta& the University of Oxford, "𝐂𝐨𝐦𝐩𝐮𝐭𝐞 𝐚𝐬 𝐓𝐞𝐚𝐜𝐡𝐞𝐫: 𝐓𝐮𝐫𝐧𝐢𝐧𝐠 𝐈𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐂𝐨𝐦𝐩𝐮𝐭𝐞 𝐈𝐧𝐭𝐨 𝐑𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞-𝐅𝐫𝐞𝐞 𝐒𝐮𝐩𝐞𝐫𝐯𝐢𝐬𝐢𝐨𝐧," tackles this head-on. The approach is elegantly simple yet powerful: an LLM generates a group of parallel answers ("rollouts") for a prompt. A frozen "anchor" model then analyzes these rollouts, not to pick the best one, but to synthesize a new, superior reference by reconciling contradictions and omissions. This synthesized answer becomes the teacher. The results: Using CaT as a training signal with reinforcement learning (CaT-RL), Llama 3.1 8B's performance jumped by up to +33% on the MATH-500 benchmark and +30% on the non-verifiable HealthBench dataset. Crucially, the model proves it's not just selecting the majority vote; it's creating a genuinely new, more accurate answer. #AI #MachineLearning #ReinforcementLearning #LLM #Research #Innovation

  • View profile for Artem Golubev

    Co-Founder and CEO of testRigor, the #1 Generative AI-based Test Automation Tool

    35,951 followers

    If your product has an LLM feature, “expected = exact string” is the wrong assertion. That’s the first reason teams think AI features are “impossible to automate.” They’re not impossible. You just need different checks. Instead of asserting exact phrasing, assert behavior: Intent: did it answer the question? Constraints: did it stay within policy (no PII, no disallowed content)? Structure: did it return the right format (JSON, bullets, fields present)? Grounding: did it reference the right sources when required? Boundaries: did it refuse when it should refuse? In other words: test outcomes and invariants, not words. The teams that get this right treat AI features like any other system with variability: you test the contract, not the implementation. This is also why I’m a fan of writing tests around user journeys and outcomes. When the goal is explicit, automation becomes much easier even when the output isn’t identical every run. How are you testing AI features today: golden datasets, rubric-based checks, or human review?

Explore categories