🎯 I've been preaching this for two years now, and after a lot of careful experimenting, I've decided to go all-in: from now on, large language models will be my weapon of choice for localization QA and LQA. Despite the elegance of our standardized approaches to quality management, there are serious problems with the current algorithms for QA and LQA. Are we still measuring quality the right way? Absolutely not. ❌ Traditional QA generates a miserable amount of false positives (and false negatives), lacks context awareness, and offers little to no customization. It is designed to compare dots and commas between source and target — nothing more. ❌ LQA is beautifully standardized but fully human-driven, hence time consuming and very sensitive to bias. It doesn't resolve cases where translators and reviewers fail to agree on what a translation should look like. ✅ With LLMs, there are many opportunities to repurpose quality management in our industry: 🔹 We can define exactly what needs to be checked. 🔹 We can determine the criteria for considering something a translation mistake. 🔹 We can provide examples of best practices vs. what to avoid. 🔹 We can identify errors that often slip through the cracks of traditional QA methods. 🔹 We can incorporate resources typically reserved for human linguists, like style guides. 🔹 We can specify the required output format; an error description, a quality score, a "pass" vs. "fail" flag, an automated correction, etc. 🔹 Most importantly, we can specify what "quality" exactly entails for each business case, content type, and target audience. Quality is in the eyes of the beholder. Yes, I'm infinitely optimistic, but also realistic: LLM-driven quality management won't always be perfect either. However, when a far-from-perfect algorithm hasn't evolved in +20 years, it's time for a heavy redesign. #translation #localization #internationalization #genai #llms #tms
LLM Applications in Resolving Quality Issues
Explore top LinkedIn content from expert professionals.
Summary
LLM applications in resolving quality issues involve using large language models—advanced AI tools that process and generate human-like text—to spot, analyze, and fix problems in data, translations, code, and other business workflows. These models help automate quality checks, reduce human bias, and make it easier to customize solutions for specific needs.
- Automate error detection: Use LLMs to quickly identify mistakes in data and content, saving time and reducing manual effort.
- Improve review transparency: Let LLMs generate clear explanations for quality issues and suggest ways to address them, making decision-making more straightforward.
- Customize quality standards: Tailor LLM prompts and evaluation criteria to fit your business goals, content types, and target audiences for more relevant results.
-
-
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
-
A few years ago, we found ourselves knee-deep in a project aimed at streamlining data pipelines for a recommendation engine. Sounds exciting, right? Well, not when unreliable data from knowledge graphs kept throwing a wrench in the works. Identifying and fixing errors felt like searching for a needle in a haystack—think endless manual effort and relying on tools that left much to be desired. That experience taught me a vital lesson: the accuracy of knowledge graphs isn’t just important, it’s foundational for AI-driven solutions. Fast forward to today, I came across a paper that felt like a time machine delivering the solution I wished we had back then: Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs. The innovative MAKGED framework combines multi-agent collaboration and large language models (LLMs) to tackle error detection in a way that’s not just accurate but also transparent and scalable. It’s a game-changer, not only outperforming state-of-the-art methods but also unlocking new possibilities for industrial applications where data quality is non-negotiable. If you’ve ever wrestled with Knowledge graph challenges, I highly recommend diving into this paper. It’s an inspiring reminder of what’s possible when innovation meets real-world challenges. #GenAI #LLM #knowledgegraphs #reviewswithsrujana #dataquality #LLMS
-
When you’re building high-quality production software, writing code is always the last step. Your first step is to understand the problem. Then ideate solutions, consider alternatives, analyze tradeoffs and refine your exploration into a concrete plan. Only after all this upfront design and planning work do you then start manually typing code with your fingers. Before you think of the code to write you think about how changes affect the overall architecture of the system, creating clean abstractions, handling edge cases, ensuring correctness and reliability. Typing is just the thing that happens after all of that planning. LLMs do help with this because they solve for mental fatigue, not just coding speed. LLMs are cognitive power tools. Mental fatigue is a real issue for humans. Most of the software we use today isn’t as good as it could be because humans had to make realistic tradeoffs based on the available bandwidth they had left. We know how to make our software higher quality but we assign it to tech debt meant to be revisited later. LLMs let us build higher-quality systems with less effort because we can use them to explore multiple approaches to a problem deeply, iterate more, test more, refactor more, and still have mental energy left to make critical decisions. I don’t jump straight into asking an LLM to code. I explain the problem, give my initial ideas and prompt it: “Propose an approach first. Show alternatives to my solution, highlight tradeoffs. Do not write code until I approve.” Then I review the proposal, poke holes in it, and iterate. All through natural conversation with the LLM. This is far more impactful than just increased coding speed. We are not machines, our mental energy is finite. When done right, LLMs let us dedicate more brain power to higher leverage concerns for longer, and this leads to better quality from deep due diligence. 𝘛𝘩𝘪𝘴 𝘪𝘴 𝘢𝘯 𝘦𝘹𝘤𝘦𝘳𝘱𝘵 𝘧𝘳𝘰𝘮 𝘮𝘺 𝘢𝘳𝘵𝘪𝘤𝘭𝘦 “𝙒𝙧𝙞𝙩𝙞𝙣𝙜 𝙃𝙞𝙜𝙝 𝙌𝙪𝙖𝙡𝙞𝙩𝙮 𝙋𝙧𝙤𝙙𝙪𝙘𝙩𝙞𝙤𝙣 𝘾𝙤𝙙𝙚 𝙒𝙞𝙩𝙝 𝙇𝙇𝙈𝙨 𝙄𝙨 𝘼 𝙎𝙤𝙡𝙫𝙚𝙙 𝙋𝙧𝙤𝙗𝙡𝙚𝙢” 𝘧𝘶𝘭𝘭 𝘢𝘳𝘵𝘪𝘤𝘭𝘦 𝘩𝘦𝘳𝘦 https://lnkd.in/dEj7QDuc
-
Here is how you can test your applications using an LLM: We call this "LLM as a Judge", and it's much easier to implement than most people think. Here is how to do it: (LLM-as-a-judge is one of the topics I teach in my cohort. The next iteration starts in August. You can join at ml.school.) We want to use an LLM to test the quality of responses from an application. There are 3 scenarios in one of the attached pictures: 1. Choose the best of two responses 2. Assess specific qualities of a response 3. Evaluate the response based on additional context I'm also attaching three example prompts to test each of the scenarios. These prompts are a big part of a successful judge, and you'll spend most of your time iterating on these prompts. Here is the process to create a judge: 1. Start with a labeled dataset 2. Design your evaluation prompt 3. Test it on the dataset 4. Iteratively refine it until you are happy with it Evaluating an answer is usually easier than producing that answer in the first place, so you can use a smaller/cheaper model to build the judge than the one you are evaluating. But you can also use the same model, or even a stronger model than the one you are evaluating. My recommendation: Build the judge using the same model your application uses. When you have the judge working as intended, replace it with a smaller or cheaper model and see if you can achieve the same performance. Repeat until satisfied. When your judge is ready, use it to evaluate a percentage of outputs to detect drift and track any trends over time. Advantages: • Produces high-quality evaluations closely matching human judgment • Simple to set up. Don’t need reference answers • Flexible. You can evaluate anything • Scalable. Can handle multiple evaluations very fast • Easy to adjust as criteria change Disadvantages: • Probabilistic - different prompts can lead to different outputs • May suffer from self-bias, first-position, or verbosity bias • May introduce privacy risks • Slower/more expensive than rule-based evaluations • Requires effort to prepare and run Final tip: Do not use opaque judges (pre-built judges that you can't see how they work). Any changes in the judge’s model or prompt will change its results. If you can’t see how the judge works, you can’t interpret its results.
-
"Quality starts before code exists", This is how AI can be used to reimagine the Testing workflow Most teams start testing after the build. But using AI, we can start it in design phase Stage - 1: WHAT: Interactions, font-size, contrast, accessibility checks etc. can be validated using GPT-4o / Claude / Gemini (LLM design review prompts) - WAVE (accessibility validation) How we use them: Design files → exported automatically → checked by accessibility scanners → run through LLM agents to evaluate interaction states, spacing, labels, copy clarity, and UX risks. Stage - 2: Tools: • LLMs (GPT-4o / Claude 3.5 Sonnet) for requirement parsing • Figma API + OCR/vision models for flow extraction • GitHub Copilot for converting scenarios to code skeletons • TestRail / Zephyr for structured test storage How we use them: PRDs + user stories + Figma flows → AI generates: ✔ functional tests ✔ negative tests ✔ boundary cases ✔ data permutations SDETs then refine domain logic instead of writing from scratch. Stage - 3: Tools: • SonarQube + Semgrep (static checks) • LLM test reviewers (custom prompt agents) • GitHub PR integration How we use them: Every test case or automation file passes through: SonarQube: static rule checks LLM quality gate that flags: - missing assertions - incomplete edge coverage - ambiguous expected outcomes - inconsistent naming or structure We focus on strategy -> AI handles structural review. Stage - 4: Tools: • Playwright, WebDriver + REST Assured • GitHub Copilot for scaffold generation • OpenAPI/Swagger + AI for API test generation How we use them: Engineers describe intent → Copilot generates: ✔ Page objects / fixtures ✔ API client definitions ✔ Custom commands ✔ Assertion scaffolding SDETs optimise logic instead of writing boilerplate. THE RESULT - Test design time reduced 60% - Visual regressions detected with near-pixel accuracy - Review overhead for SDETs significantly reduced - AI hasn’t replaced SDETs. It removed mechanical work so humans can focus on: • investigation • creativity • user empathy • product risk understanding -x-x- Learn & Implement the fundamentals required to become a Full Stack SDET in 2026: https://lnkd.in/gcFkyxaK #japneetsachdeva
-
We love to focus on models and algorithms, but data quality makes the real difference when training LLMs! Here’s a practical guide for debugging your LLM’s training dataset… Developing an LLM. When training an LLM, we follow an iterative, two-step process: 1. Train our model 2. Evaluate our model Each time we train a new model, we perform some intervention. Usually, this intervention is data-related. We keep everything the same, tweak our data, and see if performance improves. Data curation strategies. There are two ways we can approach tweaking our data: - Data-focused curation: directly look at the data and analyze its properties to find (and debug) existing issues. - Model-focused curation: train an LLM over our data, find issues in its output and use these issues to find corresponding problems in the data. Data-focused curation does not require training a model, which makes it useful in the early phases of developing an LLM. But, we should use both of these strategies in tandem. Data-focused curation. To gain a deep understanding of our data, we need to start with manual inpsection. Although this process is tedious, it’s extremely important and done by all effective researchers–the more data you manually inspect the better. As we inspect, we will begin to notice—and fix in some cases—issues and patterns in our data. To scale this curation process beyond our own judgement, however, we use automated techniques based either upon heuristics or other machine learning models (e.g., fastText models or LLM-as-a-Judge-style models). Model-focused curation. Once we have started training LLMs over our data, we can use these LLMs to debug issues in the dataset. The idea of model-focused curation is simple, we just: - Identify problematic or incorrect outputs produced by our model. - Search for instances of training data that may lead to these outputs. The identification of problematic outputs is handled through our evaluation system. We can have humans (even ourselves!) identify poor outputs via manual inspection or efficiently find low-scoring outputs via our automatic evaluation setup. OLMoTrace. Once problematic outputs are identified, we can use standard search techniques to match outputs to training data. However, researchers have developed specialized techniques for this purpose as well. For example, OLMoTrace uses a specialized span matching algorithm to efficiently trace model outputs over pre-training-scale data.
-
NotebookLM: "This academic paper introduces a novel healthcare agent designed to enhance the capabilities of large language models (LLMs) in complex real-world medical consultations, addressing issues with inquiry quality and safety. The agent's architecture comprises three essential modules: a Dialogue component for safe, effective conversations, a Memory component to store patient history and ongoing dialogue, and a Processing component for generating consultation reports. Through evaluations using both medical professional assessments and an automated system, the researchers demonstrated that this modular framework significantly improves overall LLM performance across critical metrics like inquiry quality, response accuracy, and safety, highlighting a new paradigm for applying LLMs in healthcare without relying solely on expensive, one-time fine-tuning." From the source: "With the impressive success of large language models (LLMs), there hasbeen increasing attention on their application in medical consultations2–4. However, applying general LLMs to medical consultations in real-world scenarios presents several significant challenges. First, they cannot effectively guide patients through the step-by-step process of articulating their medical conditions and relevant information, a crucial element of real-world doctor-patient dialogue. Second, they lack the necessary strategies and safeguards to manage medical ethics and safety issues, putting patients at risk of serious consequences. Third, they cannot store consultation conversations and retrieve medical histories." "Both automated and doctor evaluations confirm that our healthcare agent significantly enhances general LLMs across seven key metrics: inquiry proactivity, inquiry relevance, conversational fluency, accuracy, helpfulness, harmfulness reduction, and self-awareness. Our ablation studies attribute these improvements to the synergistic design of the agent’s modular architecture. The Function module enables versatile scenario handling, with its Inquiry sub-module substantially improving information-gathering capabilities. The Safety module enhances response accuracy and self-awareness while minimizing potential harm. The Doctor module provides human-in-the-loop refinement, while the Memory component leverages both current and historical patient information to improve diagnostic accuracy and recommendation quality." see also: https://lnkd.in/ehhvWmyR https://lnkd.in/e52qiArV
-
Achieving 3x-25x Performance Gains for High-Quality, AI-Powered Data Analysis Asking complex data questions in plain English and getting precise answers feels like magic, but it’s technically challenging. One of my jobs is analyzing the health of numerous programs. To make that easier we are building an AI app with Sapient Slingshot that answers natural language queries by generating and executing code on project/program health data. The challenge is that this process needs to be both fast and reliable. We started with gemini-2.5-pro, but 50+ second response times and inconsistent results made it unsuitable for interactive use. Our goal: reduce latency without sacrificing accuracy. The New Bottleneck: Tuning "Think Time" Traditional optimization targets code execution, but in AI apps, the real bottleneck is LLM "think time", i.e. the delay in generating correct code on the fly. Here are some techniques we used to cut think time while maintaining output quality: ① Context-Rich Prompts Accuracy starts with context. We dynamically create prompts for each query: ➜ Pre-Processing Logic: We pre-generate any code that doesn't need "intelligence" so that LLM doesn't have to ➜ Dynamic Data-Awareness: Prompts include full schema, sample data, and value stats to give the model a full view. ➜ Domain Templates: We tailor prompts for specific ontology like "Client satisfaction" or "Cycle Time" or "Quality". This reduces errors and latency, improving codegen quality from the first try. ② Structured Code Generation Even with great context, LLMs can output messy code. We guide query structure explicitly: ➜ Simple queries: Direct the LLM to generate a single line chained pandas expression. ➜ Complex queries : Direct the LLM to generate two lines, one for processing, one for the final result Clear patterns ensure clean, reliable output. ③ Two-Tiered Caching for Speed Once accuracy was reliable, we tackled speed with intelligent caching: ➜ Tier 1: Helper Cache – 3x Faster ⊙ Find a semantically similar past query ⊙ Use a faster model (e.g. gemini-2.5-flash) ⊙ Include the past query and code as a one-shot prompt This cut response times from 50+s to <15s while maintaining accuracy. ➜ Tier 2: Lightning Cache – 25x Faster ⊙ Detect duplicates for exact or near matches ⊙ Reuse validated code ⊙ Execute instantly, skipping the LLM This brought response times to ~2 seconds for repeated queries. ④ Advanced Memory Architecture ➜ Graph Memory (Neo4j via Graphiti): Stores query history, code, and relationships for fast, structured retrieval. ➜ High-Quality Embeddings: We use BAAI/bge-large-en-v1.5 to match queries by true meaning. ➜ Conversational Context: Full session history is stored, so prompts reflect recent interactions, enabling seamless follow-ups. By combining rich context, structured code, caching, and smart memory, we can build AI systems that deliver natural language querying with the speed and reliability that we, as users, expect of it.
-
LLM models make a TON of mistakes, but with 1. good documentation, 2. good code review, 3. the best models available, you can flawlessly accomplish very large changes, FASTER and BETTER than a human. Here’s a real example. At Formation, we have Session Studio: our live session environment. It’s a real-time system with video, audio, chat, reactions, slides, hand-raising, polls, collaborative coding pads… the works. We recently changed the definitions of participant roles. It was a deep permission and behavior refactor across a complex, real-time surface area with dozens of flags and conditional checks. The kind of change that’s easy to partially ship and quietly break production. Here’s how I used AI to pull it off: 1. Full System Audit: Codex generated a ~1,300-line audit of the entire current state, every permission path, flag, edge case, and role interaction. 2. Proposed Redesign: Codex then wrote a second document detailing every change required to support the new role definitions. 3. Engineering Plan: Using "plan mode" first, Claude merged both documents into a structured engineering spec with clear implementation phases. 4. "Adversarial" Iteration: Claude and Codex iterated on the docs, flagging inconsistencies, ambiguities, and decisions that required human judgment. I acted as editor-in-chief, resolving tradeoffs and clarifying intent. 5. Phased Execution (8 Phases). For each phase: Claude implemented, Codex reviewed, Claude fixed... Repeat until clean, then Final Claude review. Total time: ~24 hours of async back-and-forth. The key insight: LLMs are unreliable in isolation. They’re extremely powerful inside a system of documentation, review, and phased execution.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development