Prompting LLM to Code
Image created with ChatGPT

Prompting LLM to Code

Coding with LLM / Vibe Coding / Prompt Engineering / Context Engineering, are the emerging development paradigms. Not new, but evolving and maturing.

I spent the last 7 months with various LLMs tinkering with a simple app idea. I wanted to check how good/bad are the codes generated by the LLMs, in an exhaustive way. The quality of the code is supposed to be like a 3-5 yr experienced developer would build. Or at least that’s what I presumed.

TL; DR: Vibe Coding or Prompt Engineered coding seems to work well for a Simple, unidimensional, non-remote/non-API, low Non-Functional-Requirement heavy task. For everything else, as of now, it is still better to code by hand. Unless you are using Context Engineering with Agentic.

If you are still with me, then let’s get into the details. It’s going to be a large post. 😊

My app idea was simple. I wanted to create a Role-Based-Access-Control enabled web application which will have 2 roles, admin and user. The app will have 5 users, out of which 1 will be an admin and the rest are normal users. There will be a login screen with a company logo (image to be provided by me) which will allow the admin user to view all the users in one screen + a landing page with a custom/user specific message with login time and number of concurrent logins + a reporting page which will call an API to bring some data from a local PGSQL DB. For normal users they will see a custom message with their name with the last time they logged in + a reporting database API enabled page which will generate a report from a local PGSQL DB.

So, in all, a web app, 5 APIs (2 for RBAC + 2 for data + 1 for concurrent login details), local DB insert scripts and an environment setup script.

I tried with 4 LLMs. OpenAI o3, Claude, Qwen and Gemini 2.5.

None of the above LLMs could produce even a decent web app with API calls in the first few iterations and repeated prompt optimizations.

I was worried that my prompts were wrong. That perception “prompted” me to do a Udemy course on Prompt Engineering to learn more about it. 😊 And it helped to a great extent. The results that you will see below are after the Prompt Engineering course.

But it still was not generating the quality output as promised by many blogs/youtubers. This led me to Agentic AI approaches. Aided by the Agentic approach, the quality of the output vastly improved. However, with significant increased cost of operations.

Let’s investigate the outcome of the iterations based on the following aspects and tools used.

  • NFR – Non-Functional Requirement
  • Security – OWASP Top 10 Requirements
  • Code Quality – Standard SonarQube enabled.
  • Code Complexity – SonarQube
  • SAST– SonarQube
  • DAST - ZAP
  • UX – How intuitively the pages are connected and arranged.
  • UI – How good is the project structure, reusability and asset segregation?

General observations and findings for the entire process:

  • This is purely a personal endeavor. What started as a learning for prompt engineering moved to Agentic and later to Context Engineering.
  • I have used the standard code quality indexes/setup locally. Each of the code generation and quality check processes were isolated.
  • Kubernetes / Docker were not used.
  • I am personally a fan of Gitea and have used it here.
  • APIs are written in Python + FastAPI. Sorry Java folks!
  • Qwen’s code referenced some of the Chinese university libraries and that had to be manually changed.
  • From pure developer experience in setting up the environment and access, OpenAI leads by miles. Google’s approach is strangely inadequate. Claude is somewhere in between.
  • Response times are almost similar for OpenAI and Gemini. Claude needs improvement.

Adoption of the AGENTIC approach

  • Agentic approach is well supported by Gemini and OpenAI. For Claude, it’s a bit odd. May be my program was not proper hence not detailing it further.
  • Agentic approach is costly. Atleast 2-3 times more costly than pure LM based approach.
  • To circumvent the cost part, I have used local OLLAMA based reasoning model to decompose the tasks before handing it over to a online LLM.
  • May be, just maybe, if there is a way to call an LLM API at a lower cost just for the reasoning, it will help the enterprise operations.
  • I have used DAG and Dynamic Decomposition based patterns for all these operations.

Adoption of the CONTEXT engineering

  • It’s new so it will take some time to mature. But I find it far more exciting and aligned with enterprise than prompting.
  • I found it significantly more capable than prompting and coupled with agentic approach, possibilities are truly endless.
  • I will detail the context engineering approach in separate post.

First: OpenAI

Article content
OpenAI

Second: Claude

Article content

Third: Qwen via Huggingface

Article content

Fourth: Gemini 2.5 with Vertex AI

Article content

As you can see the code quality and the ability to adhere to NFRs are poor. Its not satisfactory for most of the iterations.

So I switched to Agentic Approach. As mentioned earlier Agentic approach improved the quality by leaps and bounds.

OpenAI O3 + AutoGen

Article content
Quite clearly, I can achieve a similar quality of the product with half the iterations than prompt engineering.

Second: Claude + LangGraph / Claude Code CLI

Article content
Here the first iteration itself produced a great quality. Most surprisingly the app passed both SAST and DAST.

Third: Qwen via Huggingface + CrewAI

Article content
DAST Failure is consistent

Fourth: Gemini 2.5 with Vertex AI + A2A + LangGraph

Article content
Output quality seems to be the best among the LLMs I checked. Also, it passed both SAST and DAST

These results do not show the effect of Context engineering (CE). CE reduced the iterations and with Agentic, the results were, well, scarily good.

Ending the article with the hope that you, the readers, will add more to this. My approach came from a purely learning aspect and hence can be fraught with errors, gaps, and chances of betterment.

Any question/clarification/comment will be highly appreciated.

Tanmoy Roy Great breakdown. This aligns with what we’re seeing, prompting hits limits fast, especially for production-grade apps. Agentic + context engineering unlocks real value, but comes with cost tradeoffs.

Helpful insight, Tanmoy BTW Did you try Google AI Studio - I felt this is good for building prototype and small projects (web apps) in compare to ChatGPT and CoPilot agents. It supports prompt engineered coding and uses Gemini LLM family.

Thanks for sharing, Tanmoy

Never really went with vibe coding. But I’m getting good results with Claude. The selection of libraries is something we have to control in the rules I’m guessing. Agentic is working out. I’m just dipping my toes into context Engjneering and it’s promising. I know you are surprised ;) Btw which version of Claude did you use? 3.5 was good. 3.7 sucked and 4 I’m still playing. The Jury is out, though people say it’s really good.

To view or add a comment, sign in

More articles by Tanmoy Roy

Others also viewed

Explore content categories