Top LinkedIn Content on Building Trust In Software

AI Strategist | Monetizing Data & AI For The Global 2K Since 2012 | 3X Founder | Best-Selling Author

209,648 followers 11mo

ChatGPT’s new reasoning models are hallucinating more often than previous generations, and the cause is still under investigation. It appears that adding the ability to break tasks down degraded OpenAI’s LLM reliability. On the PersonaQA and SimpleQA benchmarks, OpenAI’s new reasoning models (o3 and o4-mini) hallucinated between 33% and 79% of the time. The problem is likely to be impacting Google and DeepSeek’s reasoning models. It may be a cascading failure where multiple calls to the LLM amplify minor issues and inaccuracies as more steps are completed. The result of piling all those small inaccuracies on top of each other could be more noticeable errors. In any case, reasoning models fail at rates that make them unusable for consumer-facing products. It’s another setback to productizing LLMs, and there’s no timeline for when reliability will improve. For now, small language models (SLMs) are the best option for generative AI products. They cost less and are easier to put guardrails around. Post-training SLMs with domain-specific data helps them achieve higher reliability than LLMs. SLMs lack the horizontal breadth of knowledge but make up for it with vertical depth, enabling a narrow set of capabilities. They can support a few workflows well, but don’t generalize like LLMs are intended to. However, LLMs don’t meet the reliability requirements for most use cases. When they generalize, users can’t trust the output, so they can’t be integrated into AI products, especially agents that take action independently. As Anthropic recently discovered, we can’t trust the LLM’s explanations of how they arrived at the answers and output they generate. LLMs will often provide an explanation that doesn’t fit the reality of their internal processes. The fact that LLMs are unexplainable and unreliable means they aren’t ready for prime time. That doesn’t mean the technology is useless, and it’s essential not to overlook what does work (SLMs) just because some things don’t.

173 Comments

Peiru Teo

CEO @ KeyReply | Hiring for GTM & AI Engineers | NYC & Singapore

8,586 followers 2mo

It shouldn’t surprise people that LLMs are not fully deterministic, they can’t be. Even when you set temperature to zero, fix the seed, and send the exact same prompt, you can still get different outputs in production. There’s a common misconception that nondeterminism in LLMs comes only from sampling strategies. In reality, part of the variability comes from how inference is engineered at scale. In production systems, requests are often batched together to optimize throughput and cost. Depending on traffic patterns, your prompt may be grouped differently at different times. That changes how certain low-level numerical operations are executed on hardware. And because floating-point arithmetic is not perfectly associative, tiny numerical differences can accumulate and lead to different token choices. The model weights haven’t changed, neither has the prompt. But the serving context has. Enterprise teams often evaluate models assuming reproducibility is guaranteed if parameters are fixed. But reliability in LLM systems is not only a modeling problem. It is a systems engineering problem. You can push toward stricter determinism. But doing so may require architectural trade-offs in latency, cost, or scaling flexibility. The point is not that LLMs are unreliable, but that nondeterminism is part of the stack. If you are deploying AI in production, you need to understand where it enters, and design your evaluation, monitoring, and governance around it.

5 Comments

Sneha Vijaykumar

Data Scientist @ Takeda | Ex-Shell | Gen AI | LLM | RAG | AI Agents | Azure | NLP | AWS

25,180 followers 3mo

If I had to make LLM systems reliable in production, I wouldn’t start by adding more prompts. I’d focus on mastering these ideas: • Grounding outputs back to source data • Designing clear input and output contracts • Detecting when the model is uncertain • Validating structured outputs before use • Isolating failures so one bad call doesn’t break the system • Adding checkpoints instead of long fragile chains • Building retries with intent, not blind loops • Logging decisions, not just final answers • Evaluating behavior over time, not one-off responses None of this shows up in demos. All of it shows up in real systems. Most LLM failures aren’t “model issues”. They’re engineering discipline issues. If you care about deploying GenAI beyond notebooks, these are the skills that actually matter. #LLM #GenAI #AIEngineering #ProductionAI #SystemsDesign #Interviews #AI #Jobs Follow Sneha Vijaykumar for more... 😊

2 Comments

Div Rakesh

4,552 followers 8mo

Large Language Models are powerful — but they’re not software factories. They can generate code, automate documentation, and accelerate workflows, but… 👉 They lack system integration 👉 They don’t guarantee compliance & governance 👉 They can’t ensure enterprise-grade reliability Enterprises don’t just need “working code.” They need secure, scalable, and maintainable systems — something LLMs alone can’t deliver (yet). 💡 The real opportunity: using LLMs within a broader AI + engineering strategy, not as a replacement.

From Promise to Practice: Why LLMs Aren’t Fully Enterprise-Ready (Yet) Div Rakesh on LinkedIn

Maria Palma

General Partner at Freestyle Capital. Investing in amazing technical founders. Writing on Substack @unconstrained

6,780 followers 8mo

When Probabilities Compound: Why Agent Accuracy Breaks Down The obvious thing about LLMs I still think isn't talked about enough. In traditional software, you can run the same input a million times and get the exact same output. That’s determinism. CPUs are the archetype here—perfectly predictable, clockwork precise. LLMs don’t work that way. They’re probabilistic. Every output is a weighted guess over possible tokens. You can tune the randomness (temperature), but even at zero, small differences in context or prompt can shift results. GPUs—built for parallel matrix multiplications—are what make this possible at scale, but they’re also part of the probabilistic paradigm that’s replacing deterministic computation in many workflows. Many people I talk to every day in AI still haven’t wrapped their heads around this enough. As an Industrial Engineer by degree, the statistics hits you in the face. Now add agents into the mix. Those deep in AI know this intimately but newer founders and builders in the agentic space are learning this the hard way. One LLM call → slight uncertainty. Chain 5–10 LLM calls across an agent workflow → you’re compounding that uncertainty. It’s like multiplying probabilities less than 1 together—the overall accuracy drops fast. You have errors compounding. This matters if you’re building with multi-step reasoning, tool use, or autonomous agents: Your workflow is only as reliable as the weakest probabilistic link Guardrails, verification, and redundancy aren’t “nice-to-haves”—they’re architecture The longer your chain of calls, the more you need to design for failure modes. Probabilistic systems open up new possibilities that deterministic systems never could. But if you don’t understand how probabilities compound, you’ll overestimate what’s possible—and ship something brittle. To me, this is what squares the disconnect I’m hearing in market where in many ways we are “ahead” of where we thought might be with agents and in many ways we are “behind.” As VCs, we’re watching the founders who design for this reality, not against it. They’re the ones building AI systems that will stand up in production. For entertainment value and a reminder, three screenshots below, courtesy of a friend all wrong but presented by Google Gemini as the answer to a simple question. Some you can see in plain sight they are wrong but some you have to know the correct answer (tallest building one, which is WAY off, to know). We still aren't that accurate on a single LLM call, let alone a daisy chain of agents. 💭 Curious: How are you mitigating compounded uncertainty in your LLM workflows? What deterministic tools are you adding in to improve accuracy?

8 Comments

Joe Woodham

Senior product designers embedded in 7 days, not 12 weeks. No ramp-up. No risk. Proven across 100+ product teams.

23,460 followers 6mo

The problem isn’t your components. It’s how people feel using them. A design system isn’t there to look clean. It’s there to create confidence. Because trust is the real output of any system. Here’s how it usually breaks: – Everyone has access, but no one feels ownership – Updates happen quietly, then break delivery – Components exist, but don’t reflect product goals – People double-check decisions instead of moving forward The result? A system that slows you down instead of speeding you up. What you need instead: 1.) Predictable decisions → Create decision patterns, not approval chains. Map the 3–4 recurring design choices your team makes every sprint and document how they’re decided. When everyone knows the process, they stop waiting for permission. 2.) Visible ownership → Name who maintains what and make it public. Every component, rule, and doc should have an owner in Figma or Notion. Ownership builds accountability, and accountability builds trust. 3.) Change rhythm → Treat updates like releases, not surprises. Announce system changes with short “release notes.” 4.) Alignment to product priorities → Link design debt to business impact. When the system evolves around product goals, not designer preferences, it becomes a tool for delivery, not decoration. 5.) Cross-discipline check-ins → Reflect, don’t inspect. Stop reviewing pixels. Start reviewing how the system actually supported delivery this sprint. Design systems aren’t about consistency. They’re about trust in the tools, in the process, and in each other. If this resonated, share it with someone leading a complex team. Follow Joe Woodham for weekly insights on design leadership, systems thinking, and what actually scales.

39 Comments

Stephanie Rix

5,173 followers 11mo

Yesterday I started my second year with HLB International 🎉 A reflection point and I thought about fundamental shift in how we need to think about our business environment. The old way of describing our world VUCA (Volatile, Uncertain, Complex, Ambiguous) has evolved into BANI: Brittle, Anxious, Nonlinear, Incomprehensible. What This Means in Plain Terms for Leaders…. Brittle means systems that look strong can suddenly break when pushed too far. Think of how one supply chain disruption can halt an entire operation. Anxious reflects the constant worry we all feel when making decisions with limited information and a back drop of unpredictable geo-politics. Nonlinear simply means small actions can create huge results or huge efforts might barely move the needle. Incomprehensible acknowledges that some situations are just too complex to fully understand, no matter how much data we gather. So to share my reflections… I see four straightforward principles that will guide us as Leaders; 1. Plan for breaks, not just bends Instead of trying to build perfect systems, by creating backup plans and quick recovery options for when things inevitably go wrong. 2. Turn worry into preparation Rather than ignoring the anxiety we all feel, acknowledge it and use it to prepare better for various outcomes. 3. Test small before going big Since we can’t always predict what will work, run small experiments first, then quickly scale up what shows promise. 4. Be comfortable not knowing everything Some problems won’t have clear answers. Make the best decisions with the information available and adjust as we learn. How This Helps Our Clients Our clients face these same challenges. Our job isn’t just solving today’s problems - it’s helping clients build simple, flexible approaches that work when tomorrow brings unexpected changes. As I look ahead to my second year, I’m excited to guide our people through this changing landscape. We won’t have all the answers, but we’ll ask the right questions and build a network that thrives on change rather than just surviving it. The most successful organisations won’t be those with the most complex strategies, but those who can adapt quickly when things don’t go as planned.

3 Comments

Sana Khalid

Startup Ecosystem Builder | Turnaround Operator | Leadership Intelligence & Exec Search

23,662 followers 4mo

Every time I see an organisation drowning in "people problems," the diagnosis is the same: The system got bigger than the trust holding it together. And when trust collapsed, leaders compensated with: - more process - more tools - more meetings - more dashboards - more policies All of which made the system even heavier. And the trust only got thinner. In a conversation yesterday, a founder said (paraphrasing): I used to know the health of the business by just walking around the office or talking to three people. Now, I need three tools and a weekly report, and I still don't know. The fix isn't more sophisticated systems. It's this: Shrink (the surface area). Reinforce (the trust). Grow (again). In practice, that means: - Clear expectations on behavior, not just outcomes. What does ownership actually look like here? What pace are we working at? Name it explicitly. - Fast feedback loops. Not annual surveys. Daily or weekly pulse checks. Can someone say "this isn't working" without it becoming a too big to handle? - Fewer moving parts. Cut the tools, meetings, and processes that exist because someone thought they should or because 'that's how we do it', not because they're load-bearing. - Safety with consequences. Psychological safety isn't the absence of accountability. It's knowing you won't be punished for honesty, but you will be held to your commitments. - Calm at the top. If leadership is chaotic, the whole system absorbs that chaos. I've seen this pattern across a range of companies: - A 200-person tech company that rolled out OKRs and hired a VP of Operations, only to feel more confused - Coworking spaces that collapsed under expansion - A family-owned business where every initiative required three approval layers that didn't trust each other In almost every case, the solution wasn't to start with new tools or frameworks (they have their place and time). It was rebuilding trust in a small core group first, then expanding from there. If you can relate to the 'messy middle', I'd say ask yourself if you have a scaling problem or a trust bandwidth problem. Can the people who need to work together do so without a meeting? Without a tool? Without checking with someone first? If not, the system is probably bigger than the trust holding it together. I believe the future of work isn't just about remote vs. in-office, flat vs. hierarchical, fast vs. sustainable. It's about building systems tight enough to carry weight - then replicating them, not just inflating them.

Raphaël MANSUY

Data Engineering | DataScience | AI & Innovation | Author | Follow me for deep dives on AI & data-engineering

33,995 followers 9mo

🚨 Reality Check: Your AI agent isn't unreliable because it's "not smart enough" - it's drowning in instruction overload. A groundbreaking paper just revealed something every production engineer suspects but nobody talks about: LLMs have hard cognitive limits. The Hidden Problem: • Your agent works great with 10 instructions • Add compliance rules, style guides, error handling → 50+ instructions • Production requires hundreds of simultaneous constraints • Result: Exponential reliability decay nobody saw coming What the Research Revealed (IFScale benchmark, 20 SOTA models): 📊 Performance Cliffs at Scale: • Even GPT-4.1 and Gemini 2.5 Pro: only 68% accuracy at 500 instructions • Three distinct failure patterns: - Threshold decay: Sharp drop after critical density (Gemini 2.5 Pro) - Linear decay: Steady degradation (GPT-4.1, Claude Sonnet) - Exponential decay: Rapid collapse (Llama-4 Scout) 🎯 Systematic Blind Spots: • Primacy bias: Early instructions followed 2-3x more than later ones • Error evolution: Low load = modification errors, High load = complete omission • Reasoning tax: o3-class models maintain accuracy but suffer 5-10x latency hits 👉 Why This Destroys Agent Reliability: If your agent needs to follow 100 instructions simultaneously: • 80% accuracy per instruction = 0.8^100 = 0.000002% success rate • Add compound failures across multi-step workflows • Result: Agents that work in demos but fail in production The Agent Reliability Formula: Agent Success Rate = (Per-Instruction Accuracy)^(Total Instructions) Production-Ready Strategies: 🎯 1. Instruction Hierarchy Place critical constraints early (primacy bias advantage) ⚡ 2. Cognitive Load Testing Use tools like IFScale to map your model's degradation curve 🔧 3. Decomposition Over Density Break complex agents into focused micro-agents (3-10 instructions each) 🎯 4. Error Type Monitoring Track modification vs omission errors to identify capacity vs attention failures The Bottom Line: LLMs aren't infinitely elastic reasoning engines. They're sophisticated pattern matchers with predictable failure modes under cognitive load. Real-world impact: • 500-instruction agents: 68% accuracy ceiling • Multi-step workflows: Compound failures • Production systems: Reliability becomes mathematically impossible The Open Question: Should we build "smarter" models or engineer systems that respect cognitive boundaries? My take: The future belongs to architectures that decompose complexity, not models that brute-force through it. What's your experience with instruction overload in production agents? 👇

39 Comments

George Hurn-Maloney

Co-Founder @ Fastino

8,080 followers 2mo

We published a case study on LLM inadequacy in healthcare last week. This week, a Nature Medicine article reinforced our findings. Luc Rocher and colleagues from Oxford Internet Institute, University of Oxford published an article in Nature Medicine testing GPT-4o, Llama 3, and Command R+ with 1,298 people across 10 medical scenarios. The results reveal what the authors call a “translation gap.” When the researchers fed the models with clean, structured data in the form of Standardized Medical Scenarios (SMS), they identified medical conditions with an average of 94.9% accuracy. However, when they used the same models to identify medical conditions in a chatbot scenario (with less structured data and more "noise"), they were only 34.9% accurate. Participants who used a chatbot identified conditions in less than 34.5% of cases, and the right course of action in less than 44.2%. This demonstrates that LLMs are excellent at encoding medical knowledge but quite poor at generating it. The researchers found that the LLMs were highly sensitive to user bias and tended to agree with the user’s assessment of the situation significantly more often than they should. This is unsurprising, given recent findings about LLM sycophancy. They also found that in chatbot scenarios, the LLMs were sensitive to even very slight variations in how users phrased questions, demonstrating overall brittleness and unreliability in medical language generation. The Nature study shows exactly why this matters: LLMs are excellent encoders of medical knowledge but poor generators in practice. This paper underscores one of the most critical success patterns we're seeing in AI right now: model architectures must be matched to their downstream tasks. Fastino Labs's GLiNER2 excels at encoding and extracting information, not generating erroneous advice. Links to the Nature Medicine paper and our blog post below. 🔗 Nature Medicine paper: https://lnkd.in/gesYWrVw 🔗 Blog: https://lnkd.in/gcNmnA8T

10 Comments

Building Trust In Software

More in Building Trust In Software

More Technology topics

Explore categories