Show, Don't Tell: The Power of Sample Implementation for Coding Agents
Really this title should read “Show and Tell” but more about that later. If you stick with me until the end, you’ll see what I mean.
There are so many opinions being shared about AI in software engineering but not much real data. To cut through the noise, our team at HTD has been running experiments to figure out what actually works. I'll be sharing results as we go, including what worked, what didn't and what surprised us.
Recently, the HTD Labs team ran an experiment to answer a question that sounds simple but turns out to be surprisingly poorly understood: what drives code quality when AI agents generate software components?
Some people may suggest that “code quality” does not matter in a world where agents generate all the code. Who cares about readability when the bots generate everything?
We take a more modest perspective, particularly when building software for critical systems such as healthcare and medical devices, that code quality is multi-dimensional and will always matter. Readability, traceability, security, runtime performance, accessibility and regulatory compliance are just some of the dimensions that will always matter in production software systems. As a result, as a firm we are betting on quality as a foundation.
There’s also a practical reason: large language models fundamentally work in a way where high quality, well structured code, that is readable and semantically well structured, facilitates more efficient automated code generation. Fundamentally LLM's are producing output which is mathematically closest in a multi-dimensional space relative to an input which has been transformed and functionally constrained to the same output dimensionality. As a result, it stands to reason that dimensional consistent inputs will be hugely beneficial for generating high quality outputs.
In our experiment we gave a coding agent a user story and supporting context to generate a React component. In some cases, the context was limited to well specified markdown guidance for implementation, in this case specified for ADA compliance. In other cases, the context included that same guidance along with executable code examples.
While we expected that machine guidance provided together with a sample implementation would perform better than guidance alone, we were surprised to see just how much of a difference it made:
Recommended by LinkedIn
The implication about what the AI agents are actually doing with the information you provide is clear: AI agents appear to more effectively compose from examples compared to interpreting and following written rules.
This makes sense when recognizing that LLMs (even “reasoning” LLMs) do not reason in the sense of formal logic. Instead as stated above, they generate a result which honors a constraint function and is closest in a computed multi-dimensional space to the untransformed input. Naturally then generation will be more effective when provided with both instruction and a prototype.
The documentation most engineering teams have built in the past was designed for human developers. Style guides, compliance checklists, and architecture decision records all assume a reader who can interpret an abstract rule and apply it to a specific situation. That's a sound assumption when the reader is a person. Our data suggests it's a flawed one when the reader is an AI coding agent.
Memorization vs. Intelligence
The common mental model is that AI agents read documentation, understand what's required, and reason about implementation. What we observed in our experiment looks closer to pattern composition. The agent looks at the working code in its context and produces output that mirrors those patterns. When the examples include accessibility attributes, the output includes them. When accessibility exists only as a checklist in a separate document, it doesn't reliably transfer even though that document was in the same context window.
LLMs process natural language and therefore prose context matters for scoping tasks and conveying business logic. But when it comes to generating code that conforms to specific technical standards, pattern composition appears to dominate over rule interpretation by a wide margin.
The key takeaway is not that LLMs are unintelligent, but rather that the behavior we observed looks much more like sophisticated composition from examples than reasoning.
The practical consequence here is that most organizations' documentation investments need a parallel layer they don't currently have. The prose standards still serve human team members and still help scope what the AI builds. But those same standards also need to be expressed as executable code embedded directly in the patterns the AI will compose from.
This is a clear signal from one experiment, not a universal law. But if the actual mechanism is pattern composition rather than rule interpretation, then a lot of the current assumptions about AI-assisted development, such as the tooling and the way teams prepare context, may be built on a misunderstanding of what the AI is doing with the information we give it.
If you're investing in AI-assisted development, it's worth asking: are your standards expressed as rules for humans to read, or as patterns for agents to compose from?
As we run more experiments, we'll keep sharing what we find.
Code quality always matters. AI agents need human oversight to ensure generated code is production-ready. Codaro verifies AI output before deployment.
Locked in