Go back to coding

Ghislain Bourgin

Published Jan 17, 2026

The "Micro-app builders", or non-tech "vibe coders" population, that I introduced previously, come from very different backgrounds:

Tech-trained professionals who didn’t pursue a software development career but understand the fundamentals of programming
Business users with no understanding of variables, functions, or memory allocation
And many shades of gray in between

"Vibe coders" is definitely not a persona.

I belong to the first section and I believe this is where the opportunities are the greatest.

I recently built a prototype to try and make automated agentic evaluations for (agentic) micro-app builders... I know! Finding app devs actually doing evaluation of their agentic app is hard. Nevermind builders that might not even know evals exist 🤣

But the learning opportunity was too great.

Evaluations for the masses

The app, provided with an agent description, produces a synthetic dataset that is then validated by the user, and run the evaluation. It then analyzes the outliers and provide recommandation for edge cases to improve the agent that the builder is working on.

It is quite easy to talk about these topics in an abstract way and think you've got it.

But building, just like writing, forces you to be sharper and effectively improve your understanding of any topic.

In this particular example, it made me:

Fight with ambiguous cases I didn't think of
Realize the flaws of the agentic workflow I was trying to evaluate
Better grasp what testing criterias exist, what graders are
Understand the differences between them
Realize the specificity of the evaluation of different types of agents
Etc.

Take the example of a classification agent that is supposed to tell you if a customer request is about returning an item, cancelling the subscription or something else. Its output should be "return_item", "cancel_subscription" or "other".Hence if the output is not exactly one of these 3, something doesn’t work.

So over the course of the product you are basically going to monitor that error rate, based on the ground truth dataset of human validated entries.

That’s easy. (to understand, not to make it work reliably in practice, despite the apparent simplicity)

Recommended by LinkedIn

Why I Use Four AI Coding Tools — and Keep My Brain…

Stephanie Figas 1 month ago

Words of Wisdom from Naval Ravikant: Insights for…

Jatin Thummar 1 year ago

From Novice to Pro: How to Improve Your Coding Skills…

Vijay Pandey 3 years ago

Now take the example of a very different agent like a recommandation agent. It provides you with a list of options for what to buy next given your user profile, context, available catalog and constraints.

In recommendation systems, there is rarely a single “correct” answer.

Outputs are open-ended, catalogs evolve, and multiple options can be equally valid. What matters is not correctness, but ordering: does the agent consistently bring the best options to the top?

This is a quality gradient, not a binary truth.

An evaluation of that gradient could be done via a multi-objective evaluation with a composite score:

Score = a  relevance + b  margin − c * risk

In that case the composers’ issue is: "If there’s no single ground truth, how can I tell whether my agent is getting better?”

Problem: The choice of weight is a product policy decision can easily mask trade-offs

So over the course of the product, there is not the equivalent of the error rate of the classification agent to monitor. A more meaningful indicator could be something like "Is there at least a relevant answer in the top-K of the results provided by the agent".

You quantify this by something equivalent to the ELO score, for the chess players.

At that point you realize that the initial approach was naive and that to support all types of agents, some more work is going to be needed 🙃

Conclusion

There are levels of understanding you surely can approach intellectually.

But there is a huge difference between understanding and realizing.

So my advise for 2026: start building something, anything. I guarantee you that the journey is going to be worth it, even if you end up throwing away what you coded 😀

To view or add a comment, sign in

Go back to coding

Ghislain Bourgin

Evaluations for the masses

Recommended by LinkedIn

Conclusion

More articles by Ghislain Bourgin

Others also viewed

Mahatma Gandhi’s Impact on Coding: What Would Bapu Do with a Laptop?

Enhancing Logic in Coding: Strategies and Techniques

Level Up Your Coding: Tips and Strategies for Developers

3 Steps to Take When You're Stuck Learning to Code. 🧠

Is "Stack Anxiety" what's stopping coders from reaching their full potential?

How did you get into coding?

Top 3 Resources for Boosting your Coding Skills.

The Rise of Vibe Coding in Software Development

Leveling Up With LeetCode: A Modern Path to Coding Mastery

How to shine at coding challenges

Explore content categories

Evaluations for the masses

Recommended by LinkedIn

Conclusion

More articles by Ghislain Bourgin

Knowledge workers’ future is about building context

Others also viewed

Mahatma Gandhi’s Impact on Coding: What Would Bapu Do with a Laptop?

Enhancing Logic in Coding: Strategies and Techniques

Level Up Your Coding: Tips and Strategies for Developers

3 Steps to Take When You're Stuck Learning to Code. 🧠

Is "Stack Anxiety" what's stopping coders from reaching their full potential?

How did you get into coding?

Top 3 Resources for Boosting your Coding Skills.

The Rise of Vibe Coding in Software Development

Leveling Up With LeetCode: A Modern Path to Coding Mastery

How to shine at coding challenges

Similar topics

Vibe Coding and Its Impact on Software Engineering

Explore content categories