From human knowledge work to agent work, powered by verifiable closed-loop

When we look at the various concepts surrounding AI today—Autoresearch, Harness, Agentic Workflow, Multi-agent—they may seem like different approaches on the surface. But structurally, they are all converging toward the same idea. In the end, there is only one core question: can we build a verifiable evaluation structure and run a closed loop on top of it?

What Andrej Karpathy demonstrated with autoresearch is actually quite simple. Generate code, run it, evaluate it using validation loss, and keep or discard the result depending on whether it improves. Then repeat. Claude Code and recent agentic coding approaches follow the same pattern. They plan, execute, test, and refine—autonomously looping through the process. Harness, too, is fundamentally the same idea. It is not just a collection of prompts, but a constrained execution environment with enforced evaluation. Concepts like sub-agents or agent teams are, in essence, engineering techniques to make this loop more effective and efficient. At the center of all these ideas lies a single structure: Generate → Evaluate → Improve → Repeat, which is the closed loop.

This structure matters because it transforms AI from a generator into an optimizer. Generating something once is already a solved problem. What actually matters in real work is reaching a required level of quality. And that is only possible when there is a verifiable evaluation function—a reward signal the AI can optimize against without relying on human judgment. Autoresearch works precisely because validation loss provides such a clear and objective function.

The next question, then, is how this concept can be applied to general office work—the next frontier after software. Let’s take service planning as an example. Traditionally, the instruction is vague: “write a good service plan.” The problem is that the evaluation criteria are unclear, so humans rely on intuition to judge quality. In this setting, AI can generate a draft, but it cannot iteratively improve it in a meaningful way.

To make service planning truly work with AI, the first step is to explicitly define how the plan will be evaluated. In other words, we must transform a service plan from something that is “read” into something that is testable.

For example, a service plan can be defined with the following evaluation function:

# Service Planning Evaluation Function

Score =
- Problem clarity (0~5)        # Clarity of the problem being solved
- User value (0~5)             # Magnitude and specificity of user value
- Business impact (0~5)        # Impact on revenue or key metrics
- Execution feasibility (0~5)  # Practical feasibility of implementation
- Metric definition (0~5)      # Presence of measurable KPIs

Reject Conditions:
- Target user is not defined
- No KPI is specified
- Core functionality is unclear

Acceptance:
- Total score ≥ 15        

At this point, service planning becomes a completely different problem. AI is no longer just “writing” a plan—it can now iteratively improve it based on the score. If the first draft is penalized for missing KPIs, the next iteration adds them. If the user definition is vague, it gets refined. The system naturally evolves by removing rejection reasons one by one.

This structure can be mapped to Karpathy’s autoresearch. In code, it looks like this:

# Autoresearch Loop

while True:
    modify(train.py)              # Generate
    result = run_training()       # Execute
    score = evaluate_loss()       # Evaluate

    if score improved:
        commit()
    else:
        revert()        

And when applied to service planning:

# Service Planning Loop

while True:
    draft = generate_plan()                # Generate
    score, reasons = evaluate_plan(draft)  # Evaluate

    if score >= threshold:
        break

    draft = improve_plan(draft, reasons)   # Improve        

The mapping is straightforward:

  • train.py ↔ service plan
  • validation loss ↔ evaluation score
  • code modification ↔ document revision
  • experiment ↔ review

In other words, the same closed loop that works in software can be directly applied to service planning and other forms of knowledge work.

The key question then becomes: how well can we define the evaluation criteria? Every company already has standards for what makes a good plan, a good report, or a good decision. Sometimes these are even publicly documented. But in most cases, they exist as tacit knowledge, interpreted differently by different individuals. The moment we make these criteria explicit, quantifiable, and testable, that work becomes something AI can optimize through a loop.

The direction forward is clear. Instead of a structure where agents produce work and humans review it, we are moving toward a structure where agents complete the work end-to-end. Humans will define the evaluation criteria and design the environment in which those criteria are enforced. Just as software was the first domain AI could transform because it was inherently testable, the next competitive frontier will be how much of digital work we can make testable—how much of it we can bring into a closed loop.

To view or add a comment, sign in

More articles by Junghoon Woo

Others also viewed

Explore content categories