Pat Reilly’s Post

When can we confidently stop hitting “accept” with our coding agents? Bespoke agent harnesses may be the answer, and why the long running agents concept has always fascinated me. As an experiment, #CortexCode and I have been tinkering with ByteDance’s DeerFlow 🦌, recently trended as #1 trending GitRepo, to see what it actually takes to do this. What does it actually take to turn a "vibe" into a reliable engineering process? Bespoke Agent Harnesses Rather than just asking an LLM to "write a script," I wanted to build an agent harness to provide the constraints, tools, and evaluation logic necessary to operate safely and effectively over long periods. I built VibeFlow 🕶️ as a thought experiment to highlight two core shifts in how we may be building with AI going forward: 1. From One-Off Prompts to Long-Running Agents ⏳ Software engineering isn't a single "ask and receive" transaction. It's a process. VibeFlow supports long-path tasks—background runs that can persist for minutes or hours, navigating complex task trees, researching library docs via MCP (Model Context Protocol), and self-correcting when they hit a wall. 2. Metric-Driven Improvements (The Ratchet) 📈 Autonomous agents are generative and prone to "quality drift." To counter this, I implemented a physical ratchet logic. - Evaluates its own output against inferred Acceptance Criteria using explicit or evaluative criteria - Stores progress with checkpoints - Tracks real-time metrics: test pass rates, API latency, and security findings. The Rule: The project only advances if the metrics improve. If a run introduces a regression, the harness rolls back the state. Progress becomes a one-way street. Why Snowflake? I’ve forked the orchestration to work directly with the Snowflake Cortex REST API. I needed an environment where data governance and enterprise-grade inference are the standard. Given some time, I could also reach out to my Cortex Agents via MCP to leverage their expertise too. The Takeaway Soon, many "Human-in-the-loop" process may actually end up being a "Judge-of-the-Metrics." Eventually, we won’t just be checking code; we’ll design the harnesses that keep agents on track. Many of us are using off the shelf harnesses like Claude Code, some of us may choose to build our own. I’ll leave a link to the GitHub repo in the comments! #Snowflake #CortexCode #AgentHarness

  • diagram, text
See more comments

To view or add a comment, sign in

Explore content categories