Go back to coding
The "Micro-app builders", or non-tech "vibe coders" population, that I introduced previously, come from very different backgrounds:
"Vibe coders" is definitely not a persona.
I belong to the first section and I believe this is where the opportunities are the greatest.
I recently built a prototype to try and make automated agentic evaluations for (agentic) micro-app builders... I know! Finding app devs actually doing evaluation of their agentic app is hard. Nevermind builders that might not even know evals exist 🤣
But the learning opportunity was too great.
Evaluations for the masses
The app, provided with an agent description, produces a synthetic dataset that is then validated by the user, and run the evaluation. It then analyzes the outliers and provide recommandation for edge cases to improve the agent that the builder is working on.
It is quite easy to talk about these topics in an abstract way and think you've got it.
But building, just like writing, forces you to be sharper and effectively improve your understanding of any topic.
In this particular example, it made me:
Take the example of a classification agent that is supposed to tell you if a customer request is about returning an item, cancelling the subscription or something else. Its output should be "return_item", "cancel_subscription" or "other".Hence if the output is not exactly one of these 3, something doesn’t work.
So over the course of the product you are basically going to monitor that error rate, based on the ground truth dataset of human validated entries.
That’s easy. (to understand, not to make it work reliably in practice, despite the apparent simplicity)
Recommended by LinkedIn
Now take the example of a very different agent like a recommandation agent. It provides you with a list of options for what to buy next given your user profile, context, available catalog and constraints.
In recommendation systems, there is rarely a single “correct” answer.
Outputs are open-ended, catalogs evolve, and multiple options can be equally valid. What matters is not correctness, but ordering: does the agent consistently bring the best options to the top?
This is a quality gradient, not a binary truth.
An evaluation of that gradient could be done via a multi-objective evaluation with a composite score:
Score = a relevance + b margin − c * risk
In that case the composers’ issue is: "If there’s no single ground truth, how can I tell whether my agent is getting better?”
Problem: The choice of weight is a product policy decision can easily mask trade-offs
So over the course of the product, there is not the equivalent of the error rate of the classification agent to monitor. A more meaningful indicator could be something like "Is there at least a relevant answer in the top-K of the results provided by the agent".
You quantify this by something equivalent to the ELO score, for the chess players.
At that point you realize that the initial approach was naive and that to support all types of agents, some more work is going to be needed 🙃
Conclusion
There are levels of understanding you surely can approach intellectually.
But there is a huge difference between understanding and realizing.
So my advise for 2026: start building something, anything. I guarantee you that the journey is going to be worth it, even if you end up throwing away what you coded 😀