Damla Ikbal H.’s Post

Everyone's generating code with LLMs. Almost nobody is systematically checking it. I ran into this while translating Python to C++. The translation part was easy. Knowing whether the output was correct? That was the real problem. I haven't written C++ for a while, so I might not even recognize a wrong answer. So instead of hoping the output is correct, I added a verification layer. 3 agents, each with a different job: → Agent 1 (Gemini 2.5 Flash): translates Python to C++ → Agent 2 (GPT-5 Mini): reads the original Python and generates test expectations → Agent 3 (GPT-5 Mini): evaluates the C++ against those expectations and flags issues If the evaluation fails, the issues get fed back to the translator. It retries (up to 3 rounds) until it passes or returns its best effort. One deliberate choice: the translator and the evaluator use different LLMs. If the same model translates and evaluates, it tends to confirm its own mistakes. Gemini translates, GPT evaluates. A genuine second opinion. The whole verification is static analysis. No compiler, no execution. The evaluator reads the C++ and reasons about correctness. For deterministic math code, this works surprisingly well. For anything more complex, the lack of execution is a real gap. And the obvious next step. It started as a course exercise where you translate Python to C++ and manually compile to verify. I wanted to automate the part where you stare at the output and hope. Repo: https://lnkd.in/g3tWFUPZ

To view or add a comment, sign in

Explore content categories