Improving Code Generation Workflows
Examples of Code and Their Quality Evaluation

Improving Code Generation Workflows

In a previous article, Coding with Automated AI Evolution, I discussed a framework and tool I created to help manage the development of code generated by a GenAI bot. Since then I've added more features that help track what the bit is doing, and improved insights into what the code was doing as the AI iterated on solving the prompted problem.

One of the inherent problems with generating code, either by human or machine, is generating incorrect code. While I'd love to have the bandwidth to use metrics like Pass@k, doing so is tough because:

  • I don't have the compute resources to do a reasonable number of runs.
  • The problems I'm interested in tend not to be trivial ones with easy automated verification.

In fact, one of the larger aspects of the tool is to improve and automatically automated verification. More complicated problems and my example framework can automatically include necessary checks beyond basic syntax and solution convergence. That means the output is generally very difficult to automatically verify.

As a result, the tools has been moving towards improving human inspection and verification of the solutions.

Some of the new features include:

  • The background changes to indicate the state of the tool. Before executing, the background is white. While the bot is doing work, the background is green. When it is done it turns to a light-yellow/ivory color. Since it takes a while to run, this makes it easy to see if the Bot thinks it is done quickly with a glance.
  • Added a Code Result textbox that shows the result of the executed code. Sometimes this has a null result, but others may have more meaningful information. When writing the prompt it may be helpful to direct the bot to return meaningful results from the Immediately Invoked Function Expression (IIFE). This lets you quickly see what was last returned to the bot.
  • Added a Code Log text field that is populated by the generated code for status updates. The Framework Context tells the bot about it and instructs it to update it in the generated code with progress messages.
  • Added a Dynamic UI div and add instructions in the Framework context that any created UI elements should be childed here.
  • Added a Follow Up prompt for subsequent updates if more are needed. Subsequent prompts include the generated context from Ollama as well as the follow up prompt. It seemed to make sense to break that out since additional context is added during the cycling and difficult to manage in a fully conversational manner with the user.
  • The eval'd code would not automatically show up in the Chrome sources. But a nice hack fixed that. Adding "//# sourceURL=" with a unique code identifier will make it show up. I used the following to make it easily findable and identifiable by iteration:

js_exec += `//# sourceURL=js_execute-${$("#cycles").text()}.js`        

It's a pretty basic UI. Here is what it looks like without a prompt before any execution:

Article content
Could this be the ugliest UI ever conceived?


A Positive Case

Here is an example running the following prompt:

write js code that scans the local network for http servers. Scan all http port values between 0 and 12,000. Draw two progress bars to the canvas indicating what percentage of servers has been scanned and how far through the ports the current server is. Also show the current server IP address in the canvas. Show the final list of ip addresses and port that have http servers listening. Try to limit js memory usage.
Primitive HTTP based Network Security Scanner.

In the example above, it does a pretty good job of scanning about 100 targets per second. This seemed like it would take about 8.5 hours to run through. I was pretty happy with the UI it created. Although the math of 100 scans/s for some 3+ million targets should be around 8 hours, actual progress was much slower. It's likely that when the tab runs in the background it goes to sleep or is throttled.

I also noticed Chrome blocked some outgoing ports. I hadn't known about this, but some documentation can be found here: ERR_UNSAFE_PORT

A Tougher Case

While the last example seemed to generate good output, the following appears to have gotten confused and would not converge to a solution with the initial tool implementation.

Prompt:

write js code to implement Conway's Game of Life and display in the canvas. The grid should have have a slider glider gun in the lower left corner shooting gliders to the upper right at a 45 degree angle. The grid should be large enough to so that the gliders have at least 20 squares before hitting the end. Add a button along side the canvas to reinitialize and restart the game.

Unfortunately, the bot misunderstood the communication protocol and rather than emitting JS in a JSON object, emitted a full web page. Of course trying to execute a full web page as javascript is not going to work. In other cases, it was able to reason out of this. But in this case, it just kept trying the same thing over and over again. The worst part is that the bot mentioned many of the right concepts, but seemed to ignore that part of its reasoning for it's repeated misguided responses. Here are some snippets directly from the bot (please excuse the \n's):

  1. 'We need to respond with JS code implementing Game of Life, but we must avoid error "Cannot read properties of null (reading \'js_execute\')". The environment is likely evaluating code and expecting a property js_execute on some object. We want to avoid that error by ensuring that the object isn\'t null.
  2. So we need to prevent that.\n\nMaybe we should wrap our entire code in an IIFE that returns an object with js_execute property. Or we can just assign a variable:\n\n```\nvar life = (function() {\n // ...\n return { js_execute: function() {} };\n})();\n```\n\nBut if the sandbox still uses null, we need to ensure that the returned object is not null.
  3. Thus we need to create the container and not rely on the environment creating it.

#1 is correct, but the protocol indicates the code should be in the js_execute member of a responded json object. #2 indicates it think the the code should return the json object. This is backwards. Perhaps an improvement would be a clarification in the "api_context". To be honest, this is the first time I saw that fail. And it failed pretty spectacularly in the way it got stuck looping and using memory.

Here is a screenshot of it being stuck looping:

Article content
25 Cycles!

It was stuck for 25 cycles before I put a breakpoint in the debugger.

Here is a dump from the Chrome performance monitor:

Article content
13.5GB!!!

It was using 13.4GB. Taking a Heap snapshot, there is only 44MB in use. Checking the ollama context showed as context size of 78632 numbers, so it's not that. There must be a leak somewhere, but I'll have to inspect the code to find that. Maybe a good task for ChatGPT!

Regardless, if this were using resources that cost money, an escape hatch should probably be put in in both the prompt as well as the query() <-> eval() loop.

Some tweaks to the Framework Context seemed to help a lot. However, one issue that the bot really had a problem with was the direction of the glider gun. About 10% of attempts got the direction correct, even with follow up prompts. So I removed that requirement from the test case.

The final prompt that gave more consistent results was:

write js code to implement Conway's Game of Life and display in a canvas in dyn-ui. The grid should have have a glider gun in the center. Add a button along side the canvas to start the game. Clicking on the button should also reset the game state back to the beginning. Include a text entry box that allows specifying how many generations per second should occur. Be sure the grid is 200x200 cells.
Article content
A Successful Result

Evaluating Success Rates

Common automated code generation measurements often use simple pass/fail criteria. But I think a finer grained and metric with case specific scoring is more useful. For example, for the Game of Life case, I'd employ the following scoring methodology:

  • Points are given for achieving different criteria with subsequent feedback or iterations.
  • Validation areas are defined. Completing the tasks earlier in iterations gets more points with diminishing points per iteration (score decay).

Here is a proposal for the Game of Life problem:

Article content
Scoring Rules

For this problem, with the least aggressive decay is 50% for a 10pt score, so that limits the number of iterations to 4 before no more points can be gained. Note that this means a solution could be found automatically many iterations in, but everything decayed away except for the "No Follow Up" criteria. This penalizes solutions or work that eventually got there, but took many iterations. This could be adjusted if eventual solutions were more important than doing it efficiently. But for this problem, experience shows if it doesn't find the solution in the first few iterations, eventual success is unlikely.

Test Results

For gpt-oss:20b, the results (for 6 runs) were:

  • Mean Score 41.5 (69%)
  • Median Score 37.5 (63%)

Here is the score

Article content
Test Results for gpt-oss:20b across 6 runs

I broke down and bought some Claude credits and to run as a comparison. I had read that Claude was in the lead today for best code generation. I refactored the code so either Ollama or Claude could be plugged in. I ran against claude-sonnet-4-5. The calls to Claude for my development cost me $2.13. It looks like a full request costs $0.13. That means it took me about 16 invocations to get it working.

I could have saved money by using a simpler question and setup context. But I was interested in seeing how it behaved during debugging. I also used streaming + thinking mode to help improve my insight. I'm not sure if Streaming Mode uses more tokens, but thinking mode certainly does because there are configuration limits for both max tokens as well as a "thinking" budget. My configuration was for 32K tokens and 16K thinking tokens.

The Claude seemed 5x faster and gave much better results. A typical example took 55 seconds to get to the solution. During this exercise, I realized the context protocol I devised was a sort of very basic MCP. Claude did much better understanding my basic MCP and was able to get everything right in a single request. This shouldn't be too surprising since, while the parameter count is not public, it is estimated to be hundreds of billions of parameters.

Once I had the context being managed properly, it gave a working solution immediately.

Article content
First try!

All six runs were able to get things working. On one run, the generation rate was not functional. A single simple follow up prompt fixed the issue.

For gpt-oss:20b, the results (also 6 runs) were:

  • Mean Score 57.8 (96%)
  • Median Score 60 (100%)

Here is the score distribution:

Article content
Almost Perfect!

We can see that Claude is a huge improvement over the smaller ChatGPT model. And this shouldn't be too surprising. There is some serious investment on both training, and probably in inference here as well.

The tool was helpful while evaluating these two models and services. It provides some basic improvements in evaluating generated code, executing it and watching what is happening across a couple different models. While the code ballooned from the original ~180 lines of code in a single file to several files and about 400 lines.

Potential Next Steps

I've considered some other improvements as I went through this.

Real MCP

I think it might be more efficient to implement a subset of MCP, both as an MCP server and MCP client. I think defining an MCP JS interface that facilitates communication from evaluated JS code and the AI would be interesting. Then requests could leverage that system locally. This bring me to a related idea.

JS Process Isolation

Eval'ing the generated js in the same sandbox as the tool is neat, but in cases like the both the HTTP scanner as well as Game Of Life, things can go wrong. And having your development environment go down with the code you are generating is not optimal. However, I really don't want to have to rely on an external code interpreter like NodeJS or Python.

I think spawning a new tab with a small WebSocket shim would be good. I could write a little code that just has a basic MCP server on the other side of a WebSocket. Then the tab can be spawned by the tool with a channel to communicate with over localhost. Finally, on the tool side, I'll have the client dispatch the MCP requests for the new webpage instance and give it more free reign over DOM, etc. Two way MCP signaling over the WebSocket might we cool also.

Wrapping up

I had fun working on this and it was interesting to stumble upon some ideas already widely in use (mostly MCP). I'll leave you with an animation of a Game of Life anomaly I encountered during testing of the ChatGPT client. I especially like the movement of the empty areas in the grid as generations progress.

Is it random?

Code for the App can be found here:

https://github.com/casten/Cycler

Data for the test runs can be found here:

https://docs.google.com/spreadsheets/...


To view or add a comment, sign in

More articles by Casten Riepling

  • GPU Face Off: Part 2.

    In my last article in the series, GPU AI Face-Off: 10 Year Miner vs 2 Year Budget , I compared the performance of an…

    2 Comments
  • Insights into Breadboard Layout, Pt. 2

    In my previous article, https://www.linkedin.

  • Coding with Automated AI Evolution

    Current Vibe Coding techniques often generate code as data and often do not give the AI direct execution access. Even…

    2 Comments
  • Insights into Breadboard Layout

    In my Intro to 8-bit computer construction workshop, https://the6502.com, I follow the layout the course was heavily…

  • Ideation, Iteration and Workshopping

    A few weeks ago I found some folks interested in participating in a first pass of my 6502 Computer Workshop. (If you…

  • GPU AI Face-Off: 10 Year Miner vs 2 Year Budget

    I wondered, could an retired Ethereum mining card compete with a two year old gaming laptop GPU for AI Inference? The…

    2 Comments
  • Conway's Law and the Limits of Human Cognition and Communication

    Introduction In 1967, a computer programmer named Melvin Conway described the link between the communication structures…

  • Guess the Vintage Device

    We are back to celebrate the new year with a new installment of Guess the Vintage Device where I tear down a piece of…

    5 Comments
  • The Death of the Leet Code Interview?

    In the early days of Software Engineering, the availability and scope of standard and publicly available code and…

    1 Comment
  • The Dangers of AI Commoditization & Subsequent Weaponization

    Recent advancements in AI present humanity with a powerful new tool. In the 40's humanity unlocked atomic energy which…

Explore content categories