Your unit tests mean nothing for LLM features. assert output == expected That line of code — the foundation of every software test you’ve ever written — is useless the moment your system produces non-deterministic output. And most teams shipping AI features right now have no idea what to replace it with. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ December 2023. A Chevrolet dealership in California deployed a GPT-4-powered customer service chatbot on their website. Within days, users had prompt-engineered it into agreeing to sell a 2024 Chevy Tahoe — a $58,000 vehicle — for $1. The bot said, and I quote: “that’s a legally binding offer — no takesies backsies.” The screenshots went viral. The model was doing exactly what a poorly evaluated chatbot does: it had no output guardrails, no adversarial testing, and no system checking whether its responses made any sense before they reached customers. This is what happens when you ship an LLM feature with no evaluation pipeline. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The most common response from engineers new to LLM work is to reach for BLEU or ROUGE scores. These are the standard NLP metrics — they measure how much the generated text overlaps with a reference answer. They don’t work. Consider these two responses to the same question: Reference: “The server crashed due to a memory leak” Generated: “A memory leak caused the application to go down” These mean the same thing. A human reads both and nods. ROUGE gives the second one a score of 0.22 — nearly zero — because the words don’t overlap. The metric is measuring the wrong thing entirely. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ What actually works: a three-layer stack. Layer 1 — Deterministic checks. Free, fast, CI-friendly. Does the response refuse when it shouldn’t? Is the JSON valid? Is it hallucinating URLs? These run in milliseconds on every PR. They catch structural failures before anything else. Layer 2 — LLM-as-judge. This sounds circular. You’re using an AI to evaluate an AI. But it works because evaluation is easier than generation. Use pairwise comparison instead of a 1-5 scale — “which response is better, A or B” — and validate that the judge agrees with humans on 50-100 examples before you trust it. Layer 3 — Human review on 2% of traffic. Expensive. Focused on the queries that the automated layers flag as low confidence. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The brutal truth: Every prompt change you ship is a regression test you didn’t run. LLM systems fail silently. Your monitoring shows 200 OK and 120ms latency. Meanwhile the model has quietly started refusing queries it handled fine last week. You don’t find out until a user complains. The teams getting this right treat their eval dataset as a first-class artifact alongside their code. Full article — the full three-layer implementation, prompt regression testing in CI Link in comments ↓ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ #SystemDesign #AIEngineering #LLM #MachineLearning
Why LLM Code Needs More Than Unit Tests
Explore top LinkedIn content from expert professionals.
Summary
Large language model (LLM) code needs more than unit tests because its outputs are unpredictable and often nuanced, making traditional software testing insufficient for ensuring safe, accurate, and reliable results. Unlike standard code, LLMs can generate responses that sound plausible but may be incorrect, misleading, or unsafe, so a mix of automated and human evaluation is crucial for maintaining quality.
- Expand your evaluation: Supplement unit tests with automated checks, AI-based reviews, and occasional human scrutiny to catch subtle failures and unpredictable responses.
- Monitor real-world outputs: Regularly review your LLM’s responses in production, especially across unusual or edge-case inputs, to identify issues that standard tests may miss.
- Document and track changes: Carefully record prompt adjustments and test results, using regression tests to spot unexpected behavior and maintain consistent performance over time.
-
-
We've all shipped an LLM feature that "felt right" in dev, only to watch it break in production. Why? Because human "eyeballing" isn't a scalable evaluation strategy. The real challenge in building robust AI isn't just getting an LLM to generate an output. It’s ensuring the output is 𝐫𝐢𝐠𝐡𝐭, 𝐬𝐚𝐟𝐞, 𝐟𝐨𝐫𝐦𝐚𝐭𝐭𝐞𝐝, 𝐚𝐧𝐝 𝐮𝐬𝐞𝐟𝐮𝐥, consistently, across thousands of diverse user inputs. This is where 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 become non-negotiable. Think of them as the sophisticated unit tests and integration tests for your LLM's brain. You need to move beyond "does it work?" to "how well does it work, and why?" This is precisely what Comet's 𝐎𝐩𝐢𝐤 is designed for. It provides the framework to rigorously grade your LLM's performance, turning subjective feelings into objective data. Here's how we approach it, as shown in the cheat sheet below: 1./ Heuristic Metrics => the 'Linters' & 'Unit Tests' - These are your non-negotiable, deterministic sanity checks. - They are low-cost, fast, and catch objective failures. - Your pipeline should fail here first. ▫️Is it valid? → IsJson, RegexMatch ▫️Is it faithful? → Contains, Equals ▫️Is it close? → Levenshtein 2./ LLM-as-a-Judge => the 'Peer Review' - This is for everything that "looks right" but might be subtly wrong. - These metrics evaluate quality and nuance where statistical rules fail. - They answer the hard, subjective questions. ▫️Is it true? → Hallucination ▫️Is it relevant? → AnswerRelevance ▫️Is it helpful? → Usefulness 3./ G-Eval => the dynamic 'Judge-Builder' - G-Eval is a task-agnostic LLM-as-a-Judge. - You define custom evaluation criteria in plain English (e.g., "Is the tone professional but not robotic?"). - It then uses Chain-of-Thought reasoning internally to analyze the output and produce a human-aligned score for those criteria. - This allows you to test specific business logic without writing new code. 4./ Custom Metrics - For everything else. - This is where you write your own Python code to create a metric. - It’s for when you need to check an output against a live internal API, a proprietary database, or any other logic that only your system knows. Take a look at the cheat sheet for a quick breakdown. Which metric are you implementing first for your current LLM project? ♻️ Don't forget to repost.
-
𝗧𝗵𝗲 𝘀𝗰𝗮𝗿𝗶𝗲𝘀𝘁 𝗯𝘂𝗴 𝗶𝗻 𝗔𝗜 𝗶𝘀𝗻’𝘁 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗼𝗱𝗲. It’s in the outputs that sound right — but aren’t. I was chatting with a team last week that deployed a customer-facing AI assistant. Everything looked great on paper. Good latency, tight integration, clean UX. Then a client asked the bot a routine question. The bot responded… confidently. Persuasively. It even cited internal documentation. One problem: That doc didn’t exist. The model had hallucinated the whole thing. Nobody noticed until a human flagged it after the assistant had been live for 10 days. Here’s the thing: traditional QA misses this. You can’t lint hallucinations. There’s no stack trace for made-up facts. You don’t get an exception when an LLM gives the wrong tone or crosses a safety line. You only know when someone’s already frustrated, misled, or exposed. If you're building with LLMs, testing can't be an afterthought. You need to evaluate outputs the way your users will — across edge cases, stress scenarios, and weird real-world prompts. And the earlier you build this into your process, the less painful (and expensive) the fallout will be. What’s the strangest or most dangerous LLM failure you’ve seen in production? I’m collecting war stories — drop them in the comments 👇
-
LLMs are great for data processing, but using new techniques doesn't mean you get to abandon old best practices. The precision and accuracy of LLMs still need to be monitored and maintained, just like with any other AI model. Tips for maintaining accuracy and precision with LLMs: • Define within your team EXACTLY what the desired output looks like. Any area of ambiguity should be resolved with a concrete answer. Even if the business "doesn't care," you should define a behavior. Letting the LLM make these decisions for you leads to high variance/low precision models that are difficult to monitor. • Understand that the most gorgeously-written, seemingly clear and concise prompts can still produce trash. LLMs are not people and don't follow directions like people do. You have to test your prompts over and over and over, no matter how good they look. • Make small prompt changes and carefully monitor each change. Changes should be version tracked and vetted by other developers. • A small change in one part of the prompt can cause seemingly-unrelated regressions (again, LLMs are not people). Regression tests are essential for EVERY change. Organize a list of test case inputs, including those that demonstrate previously-fixed bugs and test your prompt against them. • Test cases should include "controls" where the prompt has historically performed well. Any change to the control output should be studied and any incorrect change is a test failure. • Regression tests should have a single documented bug and clearly-defined success/failure metrics. "If the output contains A, then pass. If output contains B, then fail." This makes it easy to quickly mark regression tests as pass/fail (ideally, automating this process). If a different failure/bug is noted, then it should still be fixed, but separately, and pulled out into a separate test. Any other tips for working with LLMs and data processing?
-
LLMs just commoditized code generation. So what's actually scarce now? I've been reading two pieces that point at the same thing from different angles. Ben Redmond argues single agent runs are variance-limited. You spin up one Claude Code session and you might hit a peak. Or you might land somewhere mediocre. Pure luck. His solution: parallel agents exploring the same problem, then synthesizing toward the best answer. 3/5 agents converge on the same approach? High confidence. All diverge? Tighten your constraints. Simon Willison comes at it differently. He's seeing junior devs drop giant untested PRs on reviewers and expect code review to do the validation work. His take is blunt: your job is to deliver code you have proven to work. Manual testing. Automated tests. Evidence in the PR. Here's the thread connecting them: The raw output is cheap. Anyone can prompt an LLM for a 1,000-line patch. What's scarce is the verification loop. Redmond builds it through parallel exploration and convergence. Willison builds it through testing discipline and human accountability. Both are saying the same thing: generation is table stakes. The new engineering skill is building systems that let you trust what the AI wrote. I think this points at something bigger. The senior engineer of 2025 isn't the best coder. They're the best orchestrator. Spinning up parallel explorations. Validating convergence. Providing the accountability a computer never can. Testing becomes the product, not the afterthought. And the real bottleneck isn't compute or model quality. It's organizational trust in AI-generated work. What verification systems are you building into your workflow?
-
Most AI code isn’t broken. It’s just broken enough to break you. LLMs sound confident. They move fast. Their code looks perfect… until it runs. Then come the silent bugs and missed edge cases. Here are 8 principles from Simon Willison that stop the bugs before they stop your team: 🔸 LLMs are junior developers, not autonomous agents ↳ They need structure, supervision, and review. You wouldn’t ship a junior’s code without checking it. Don’t ship an LLM’s code without testing it thoroughly. 🔸 Context quality determines output quality ↳ The difference between usable and unusable code often comes down to context. Include requirements, constraints, edge cases, and error handling needs. Specificity here prevents hours of debugging later. 🔸 Knowledge cutoffs matter ↳ GPT-4 was trained up to October 2023. Claude 3.5 up to April 2024. LLMs won’t know the latest changes to libraries or APIs so verify against current docs every time. 🔸 Use iterative refinement ↳ Start with a broad prompt: “What are my implementation options?” Then narrow it: “Implement option 2 using these parameters.” Then polish: “Add robust error handling and tests.” This mirrors how senior developers already think. 🔸 Test every generated line ↳ LLMs are confident, even when wrong. They excel at writing syntactically correct code with subtle logical flaws. Assume nothing works until it's tested. 🔸 Leverage safe execution environments ↳ Tools like Claude Artifacts and ChatGPT Code Interpreter let you run code in a sandbox. Validate before you deploy. This step prevents production incidents. 🔸 Embrace ‘vibe-coding’ for discovery ↳ Use vibe-coding to test ideas, experiment, and learn system boundaries. That experimentation leads to sharper production use. 🔸 LLMs amplify existing expertise ↳ They make experienced developers faster. They don’t replace core understanding. If you’re not leveling up alongside your tools, you’re falling behind. The engineers getting the most out of AI aren’t asking it to code. They’re treating it like a teammate with limits. What’s your most effective LLM workflow? ♻️ Repost to help your team use AI more strategically ➕ Follow me, Sairam, for practical AI engineering insights
-
Nearly 50 years ago, E.W. Dijkstra wrote "On the foolishness of natural language programming", arguing that computers can only be instructed reliably through formal, mechanically checkable symbols. His point was that natural language is powerful, but inherently ambiguous, high-entropy, contextual, and filled with unstated assumptions. It is a lossy, heavily compressed medium that relies on shared context more than explicit structure (hence the importance of shared “world models”). These qualities make it a poor foundation for correctness. History bears this out. Societies advanced when they moved from purely verbal reasoning to precise formal notation. The so-called “burden” of formalism was never a drawback but a safety net that made reliable reasoning possible. Fast-forward to the era of LLM-powered software development. LLMs are extraordinary high-level interfaces, but they are not a substitute for formal reasoning. They generate plausible code via learned statistical correlations, not by reasoning about or verifying intent. Ambiguities do not vanish; they are resolved probabilistically, often in ways that appear correct but are not. Instead of writing precise programs, we now spend time crafting prompts, validating outputs, and debugging mismatches between intent and behaviour. Precision is still required, but it is simply deferred (which, as Dijkstra observed, imposes cost elsewhere in a system or process). As it was 50 years ago, it seems the real guardrails remain unchanged: 1. clear specifications 2. types and invariants 3. tests and static analysis 4. formal verification where correctness truly matters Natural-language prompts are draft specifications, not executable truths. LLMs accelerate exploration and reduce friction, but correctness still comes from disciplined formalism. And when correctness truly matters, we inevitably return to mathematics: type theory, where programs correspond to proofs and many illegal states can be made unrepresentable; category theory, where composition and structure matter; and formal proof systems, which make reasoning mechanically checkable. LLMs can assist in navigating these spaces, but verification still lives in formal systems that admit no ambiguity. This matters in safety-critical domains such as critical infrastructure, finance, aviation, healthcare, and national-scale AI platforms where errors are not recoverable by iteration. As we move toward increasingly agentic and autonomous systems, the future of the field will not be defined by more fluent interfaces, but by how rigorously those interfaces are grounded in formal semantics, verified composition, and mathematically enforced constraints. I think this is where the field is probably heading: LLMs as generative tools, with formal methods as the substrate of trust. https://lnkd.in/gRNy3p6d
-
The hype around large language models has made people forget one thing: These models don’t understand truth, they predict it. An LLM will confidently tell you the wrong answer with perfect grammar. It’ll cite fake papers, create biased logic, and still sound like an expert. That’s why “trust” without validation isn’t innovation, it’s negligence. Good engineers don’t just prompt and pray. They test, verify, and question the output as much as they would question human code. AI is a collaborator, not a replacement for critical thinking. You can’t automate judgment, ethics, or context. If you remove the human from the loop, you remove accountability too. How does your team verify AI-generated results before pushing to production? #softwareengineer #llms #ai #prompts #softwareengineering #faithwilkinsel
-
Too many teams treat testing as a metric rather than an opportunity. A developer is told to write tests, so they do the bare minimum to hit the required coverage percentage. A function runs inside a unit test, the coverage tool marks it as covered, and the developer moves on. The percentage goes up, leadership is satisfied, and the codebase is left with the illusion of quality. But what was actually tested? Too often, the answer is: almost nothing. The logic was executed, but its behavior was never challenged. The function was called, but its failure modes were ignored. The edge cases, error handling, and real-world complexity were never explored. The opportunity to truly exercise the code and ensure it works in every scenario was completely missed. This is a systemic failure in how organizations think about testing. Instead of seeing unit, integration, and end-to-end (E2E) testing as distinct silos, they should recognize that all testing is just exercising the same code. The farther you get from the code, the harder and more expensive it becomes to test. If logic is effectively tested at the unit and integration level, it does not suddenly behave differently at the E2E level. Software is a rational system. A well-tested function does not magically start failing in production unless something external—such as infrastructure or dependencies—introduces instability. When developers treat unit and integration testing as a checkbox exercise, they push the real burden of testing downstream. Bugs that should have been caught in milliseconds by a unit test are now caught minutes or hours later in an integration test, or even days later during E2E testing. Some are not caught at all until they reach production. Organizations then spend exponentially more time and money debugging issues that should never have existed in the first place. The best engineering teams do not chase code coverage numbers. They see testing as an opportunity to build confidence in their software at the lowest possible level. They write tests that ask hard questions of the code, not just ones that execute it. They recognize that when testing is done well at the unit and integration level, their E2E tests become simpler and more reliable—not a desperate last line of defense against failures that should have been prevented. But the very best testers go even further. They recognize the system for what it truly is—a beautiful, interconnected mosaic of logic, data, and dependencies. They do not just react to failures at the UX/UI layer, desperately trying to stop an avalanche of possible combinations. They seek to understand and control the system itself, shaping it in a way that prevents those avalanches from happening in the first place. Organizations that embrace this mindset build more stable systems, ship with more confidence, and spend less time firefighting production issues. #SoftwareTesting #QualityEngineering
-
Your LLM app isn't broken because of the model. It's broken because you never measured it. AI Evals!! Most teams do the same thing: → Build it → Test it on 5 examples → Demo goes perfectly → Ship it → Pray Then 3 weeks in, a user screenshots your chatbot confidently hallucinating your own product pricing. Here's the eval stack that actually works: 1/ Golden dataset first. Even 20 hand-crafted examples with validated answers are enough to start. Quality over quantity. This is your source of truth. 2/ Two types of evaluators — both are required. LLM-as-judge for subjective signals (hallucination, relevance, tone). Code-based eval for structural checks (did the JSON parse? is the number in range?). One without the other is incomplete. 3/ Never use 1–10 scores. LLMs can't score consistently at that granularity across runs. Use binary (correct/incorrect) or multi-class (relevant/partially relevant/irrelevant). You can average those. You can't trust a score of 7.2. 4/ Wire evals to CI/CD. Every prompt change, model swap, or retrieval tweak runs against your golden dataset before it ships. This is your gate. LLM evaluations are your new unit tests. 5/ Add guardrails last, not first. Don't block everything. Over-indexing on guards kills user intent. Start with PII removal, jailbreak detection, and hallucination prevention. Add more when production tells you to. Your app can degrade with zero code changes. Model updates and input drift happen silently. Run your evals on a schedule, not just on deploys. Measure it. Or be surprised by it. What's your current eval setup? Drop it in the comments. Read the full blog and follow me Priyanka for more ↓ https://lnkd.in/gsjnbubY #LLMOps #AIEngineering #MachineLearning #GenerativeAI #MLOps #SoftwareEngineering #AIProductDevelopment #evals #aievals
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development