Go back to coding

Go back to coding

The "Micro-app builders", or non-tech "vibe coders" population, that I introduced previously, come from very different backgrounds:

  • Tech-trained professionals who didn’t pursue a software development career but understand the fundamentals of programming
  • Business users with no understanding of variables, functions, or memory allocation
  • And many shades of gray in between

"Vibe coders" is definitely not a persona.

I belong to the first section and I believe this is where the opportunities are the greatest.

I recently built a prototype to try and make automated agentic evaluations for (agentic) micro-app builders... I know! Finding app devs actually doing evaluation of their agentic app is hard. Nevermind builders that might not even know evals exist 🤣

But the learning opportunity was too great.

Evaluations for the masses

The app, provided with an agent description, produces a synthetic dataset that is then validated by the user, and run the evaluation. It then analyzes the outliers and provide recommandation for edge cases to improve the agent that the builder is working on.

It is quite easy to talk about these topics in an abstract way and think you've got it.

But building, just like writing, forces you to be sharper and effectively improve your understanding of any topic.

In this particular example, it made me:

  • Fight with ambiguous cases I didn't think of
  • Realize the flaws of the agentic workflow I was trying to evaluate
  • Better grasp what testing criterias exist, what graders are
  • Understand the differences between them
  • Realize the specificity of the evaluation of different types of agents
  • Etc.

Take the example of a classification agent that is supposed to tell you if a customer request is about returning an item, cancelling the subscription or something else. Its output should be "return_item", "cancel_subscription" or "other".Hence if the output is not exactly one of these 3, something doesn’t work.

So over the course of the product you are basically going to monitor that error rate, based on the ground truth dataset of human validated entries. 

That’s easy. (to understand, not to make it work reliably in practice, despite the apparent simplicity)

Now take the example of a very different agent like a recommandation agent. It provides you with a list of options for what to buy next given your user profile, context, available catalog and constraints.

In recommendation systems, there is rarely a single “correct” answer.

Outputs are open-ended, catalogs evolve, and multiple options can be equally valid. What matters is not correctness, but ordering: does the agent consistently bring the best options to the top?

This is a quality gradient, not a binary truth.

An evaluation of that gradient could be done via a multi-objective evaluation with a composite score:

Score = a  relevance + b  margin − c * risk        

In that case the composers’ issue is: "If there’s no single ground truth, how can I tell whether my agent is getting better?”

Problem: The choice of weight is a product policy decision can easily mask trade-offs 

So over the course of the product, there is not the equivalent of the error rate of the classification agent to monitor. A more meaningful indicator could be something like "Is there at least a relevant answer in the top-K of the results provided by the agent".

You quantify this by something equivalent to the ELO score, for the chess players.

At that point you realize that the initial approach was naive and that to support all types of agents, some more work is going to be needed 🙃

Conclusion

There are levels of understanding you surely can approach intellectually.

But there is a huge difference between understanding and realizing.

So my advise for 2026: start building something, anything. I guarantee you that the journey is going to be worth it, even if you end up throwing away what you coded 😀

To view or add a comment, sign in

Others also viewed

Explore content categories