Prompting LLM to Code
Coding with LLM / Vibe Coding / Prompt Engineering / Context Engineering, are the emerging development paradigms. Not new, but evolving and maturing.
I spent the last 7 months with various LLMs tinkering with a simple app idea. I wanted to check how good/bad are the codes generated by the LLMs, in an exhaustive way. The quality of the code is supposed to be like a 3-5 yr experienced developer would build. Or at least that’s what I presumed.
TL; DR: Vibe Coding or Prompt Engineered coding seems to work well for a Simple, unidimensional, non-remote/non-API, low Non-Functional-Requirement heavy task. For everything else, as of now, it is still better to code by hand. Unless you are using Context Engineering with Agentic.
If you are still with me, then let’s get into the details. It’s going to be a large post. 😊
My app idea was simple. I wanted to create a Role-Based-Access-Control enabled web application which will have 2 roles, admin and user. The app will have 5 users, out of which 1 will be an admin and the rest are normal users. There will be a login screen with a company logo (image to be provided by me) which will allow the admin user to view all the users in one screen + a landing page with a custom/user specific message with login time and number of concurrent logins + a reporting page which will call an API to bring some data from a local PGSQL DB. For normal users they will see a custom message with their name with the last time they logged in + a reporting database API enabled page which will generate a report from a local PGSQL DB.
So, in all, a web app, 5 APIs (2 for RBAC + 2 for data + 1 for concurrent login details), local DB insert scripts and an environment setup script.
I tried with 4 LLMs. OpenAI o3, Claude, Qwen and Gemini 2.5.
None of the above LLMs could produce even a decent web app with API calls in the first few iterations and repeated prompt optimizations.
I was worried that my prompts were wrong. That perception “prompted” me to do a Udemy course on Prompt Engineering to learn more about it. 😊 And it helped to a great extent. The results that you will see below are after the Prompt Engineering course.
But it still was not generating the quality output as promised by many blogs/youtubers. This led me to Agentic AI approaches. Aided by the Agentic approach, the quality of the output vastly improved. However, with significant increased cost of operations.
Let’s investigate the outcome of the iterations based on the following aspects and tools used.
General observations and findings for the entire process:
Adoption of the AGENTIC approach
Adoption of the CONTEXT engineering
First: OpenAI
Recommended by LinkedIn
Second: Claude
Third: Qwen via Huggingface
Fourth: Gemini 2.5 with Vertex AI
As you can see the code quality and the ability to adhere to NFRs are poor. Its not satisfactory for most of the iterations.
So I switched to Agentic Approach. As mentioned earlier Agentic approach improved the quality by leaps and bounds.
OpenAI O3 + AutoGen
Second: Claude + LangGraph / Claude Code CLI
Third: Qwen via Huggingface + CrewAI
Fourth: Gemini 2.5 with Vertex AI + A2A + LangGraph
These results do not show the effect of Context engineering (CE). CE reduced the iterations and with Agentic, the results were, well, scarily good.
Ending the article with the hope that you, the readers, will add more to this. My approach came from a purely learning aspect and hence can be fraught with errors, gaps, and chances of betterment.
Any question/clarification/comment will be highly appreciated.
Tanmoy Roy Great breakdown. This aligns with what we’re seeing, prompting hits limits fast, especially for production-grade apps. Agentic + context engineering unlocks real value, but comes with cost tradeoffs.
Helpful insight, Tanmoy BTW Did you try Google AI Studio - I felt this is good for building prototype and small projects (web apps) in compare to ChatGPT and CoPilot agents. It supports prompt engineered coding and uses Gemini LLM family.
Thanks for sharing, Tanmoy
Never really went with vibe coding. But I’m getting good results with Claude. The selection of libraries is something we have to control in the rules I’m guessing. Agentic is working out. I’m just dipping my toes into context Engjneering and it’s promising. I know you are surprised ;) Btw which version of Claude did you use? 3.5 was good. 3.7 sucked and 4 I’m still playing. The Jury is out, though people say it’s really good.