Improving Code Generation Workflows
In a previous article, Coding with Automated AI Evolution, I discussed a framework and tool I created to help manage the development of code generated by a GenAI bot. Since then I've added more features that help track what the bit is doing, and improved insights into what the code was doing as the AI iterated on solving the prompted problem.
One of the inherent problems with generating code, either by human or machine, is generating incorrect code. While I'd love to have the bandwidth to use metrics like Pass@k, doing so is tough because:
In fact, one of the larger aspects of the tool is to improve and automatically automated verification. More complicated problems and my example framework can automatically include necessary checks beyond basic syntax and solution convergence. That means the output is generally very difficult to automatically verify.
As a result, the tools has been moving towards improving human inspection and verification of the solutions.
Some of the new features include:
js_exec += `//# sourceURL=js_execute-${$("#cycles").text()}.js`
It's a pretty basic UI. Here is what it looks like without a prompt before any execution:
A Positive Case
Here is an example running the following prompt:
write js code that scans the local network for http servers. Scan all http port values between 0 and 12,000. Draw two progress bars to the canvas indicating what percentage of servers has been scanned and how far through the ports the current server is. Also show the current server IP address in the canvas. Show the final list of ip addresses and port that have http servers listening. Try to limit js memory usage.
In the example above, it does a pretty good job of scanning about 100 targets per second. This seemed like it would take about 8.5 hours to run through. I was pretty happy with the UI it created. Although the math of 100 scans/s for some 3+ million targets should be around 8 hours, actual progress was much slower. It's likely that when the tab runs in the background it goes to sleep or is throttled.
I also noticed Chrome blocked some outgoing ports. I hadn't known about this, but some documentation can be found here: ERR_UNSAFE_PORT
A Tougher Case
While the last example seemed to generate good output, the following appears to have gotten confused and would not converge to a solution with the initial tool implementation.
Prompt:
write js code to implement Conway's Game of Life and display in the canvas. The grid should have have a slider glider gun in the lower left corner shooting gliders to the upper right at a 45 degree angle. The grid should be large enough to so that the gliders have at least 20 squares before hitting the end. Add a button along side the canvas to reinitialize and restart the game.
Unfortunately, the bot misunderstood the communication protocol and rather than emitting JS in a JSON object, emitted a full web page. Of course trying to execute a full web page as javascript is not going to work. In other cases, it was able to reason out of this. But in this case, it just kept trying the same thing over and over again. The worst part is that the bot mentioned many of the right concepts, but seemed to ignore that part of its reasoning for it's repeated misguided responses. Here are some snippets directly from the bot (please excuse the \n's):
#1 is correct, but the protocol indicates the code should be in the js_execute member of a responded json object. #2 indicates it think the the code should return the json object. This is backwards. Perhaps an improvement would be a clarification in the "api_context". To be honest, this is the first time I saw that fail. And it failed pretty spectacularly in the way it got stuck looping and using memory.
Here is a screenshot of it being stuck looping:
It was stuck for 25 cycles before I put a breakpoint in the debugger.
Here is a dump from the Chrome performance monitor:
It was using 13.4GB. Taking a Heap snapshot, there is only 44MB in use. Checking the ollama context showed as context size of 78632 numbers, so it's not that. There must be a leak somewhere, but I'll have to inspect the code to find that. Maybe a good task for ChatGPT!
Regardless, if this were using resources that cost money, an escape hatch should probably be put in in both the prompt as well as the query() <-> eval() loop.
Some tweaks to the Framework Context seemed to help a lot. However, one issue that the bot really had a problem with was the direction of the glider gun. About 10% of attempts got the direction correct, even with follow up prompts. So I removed that requirement from the test case.
The final prompt that gave more consistent results was:
write js code to implement Conway's Game of Life and display in a canvas in dyn-ui. The grid should have have a glider gun in the center. Add a button along side the canvas to start the game. Clicking on the button should also reset the game state back to the beginning. Include a text entry box that allows specifying how many generations per second should occur. Be sure the grid is 200x200 cells.
Evaluating Success Rates
Common automated code generation measurements often use simple pass/fail criteria. But I think a finer grained and metric with case specific scoring is more useful. For example, for the Game of Life case, I'd employ the following scoring methodology:
Here is a proposal for the Game of Life problem:
For this problem, with the least aggressive decay is 50% for a 10pt score, so that limits the number of iterations to 4 before no more points can be gained. Note that this means a solution could be found automatically many iterations in, but everything decayed away except for the "No Follow Up" criteria. This penalizes solutions or work that eventually got there, but took many iterations. This could be adjusted if eventual solutions were more important than doing it efficiently. But for this problem, experience shows if it doesn't find the solution in the first few iterations, eventual success is unlikely.
Test Results
For gpt-oss:20b, the results (for 6 runs) were:
Here is the score
I broke down and bought some Claude credits and to run as a comparison. I had read that Claude was in the lead today for best code generation. I refactored the code so either Ollama or Claude could be plugged in. I ran against claude-sonnet-4-5. The calls to Claude for my development cost me $2.13. It looks like a full request costs $0.13. That means it took me about 16 invocations to get it working.
I could have saved money by using a simpler question and setup context. But I was interested in seeing how it behaved during debugging. I also used streaming + thinking mode to help improve my insight. I'm not sure if Streaming Mode uses more tokens, but thinking mode certainly does because there are configuration limits for both max tokens as well as a "thinking" budget. My configuration was for 32K tokens and 16K thinking tokens.
The Claude seemed 5x faster and gave much better results. A typical example took 55 seconds to get to the solution. During this exercise, I realized the context protocol I devised was a sort of very basic MCP. Claude did much better understanding my basic MCP and was able to get everything right in a single request. This shouldn't be too surprising since, while the parameter count is not public, it is estimated to be hundreds of billions of parameters.
Once I had the context being managed properly, it gave a working solution immediately.
All six runs were able to get things working. On one run, the generation rate was not functional. A single simple follow up prompt fixed the issue.
For gpt-oss:20b, the results (also 6 runs) were:
Here is the score distribution:
We can see that Claude is a huge improvement over the smaller ChatGPT model. And this shouldn't be too surprising. There is some serious investment on both training, and probably in inference here as well.
The tool was helpful while evaluating these two models and services. It provides some basic improvements in evaluating generated code, executing it and watching what is happening across a couple different models. While the code ballooned from the original ~180 lines of code in a single file to several files and about 400 lines.
Potential Next Steps
I've considered some other improvements as I went through this.
Real MCP
I think it might be more efficient to implement a subset of MCP, both as an MCP server and MCP client. I think defining an MCP JS interface that facilitates communication from evaluated JS code and the AI would be interesting. Then requests could leverage that system locally. This bring me to a related idea.
JS Process Isolation
Eval'ing the generated js in the same sandbox as the tool is neat, but in cases like the both the HTTP scanner as well as Game Of Life, things can go wrong. And having your development environment go down with the code you are generating is not optimal. However, I really don't want to have to rely on an external code interpreter like NodeJS or Python.
I think spawning a new tab with a small WebSocket shim would be good. I could write a little code that just has a basic MCP server on the other side of a WebSocket. Then the tab can be spawned by the tool with a channel to communicate with over localhost. Finally, on the tool side, I'll have the client dispatch the MCP requests for the new webpage instance and give it more free reign over DOM, etc. Two way MCP signaling over the WebSocket might we cool also.
Wrapping up
I had fun working on this and it was interesting to stumble upon some ideas already widely in use (mostly MCP). I'll leave you with an animation of a Game of Life anomaly I encountered during testing of the ChatGPT client. I especially like the movement of the empty areas in the grid as generations progress.
Code for the App can be found here:
Data for the test runs can be found here: