From human knowledge work to agent work, powered by verifiable closed-loop

Junghoon Woo

Published Apr 2, 2026

When we look at the various concepts surrounding AI today—Autoresearch, Harness, Agentic Workflow, Multi-agent—they may seem like different approaches on the surface. But structurally, they are all converging toward the same idea. In the end, there is only one core question: can we build a verifiable evaluation structure and run a closed loop on top of it?

What Andrej Karpathy demonstrated with autoresearch is actually quite simple. Generate code, run it, evaluate it using validation loss, and keep or discard the result depending on whether it improves. Then repeat. Claude Code and recent agentic coding approaches follow the same pattern. They plan, execute, test, and refine—autonomously looping through the process. Harness, too, is fundamentally the same idea. It is not just a collection of prompts, but a constrained execution environment with enforced evaluation. Concepts like sub-agents or agent teams are, in essence, engineering techniques to make this loop more effective and efficient. At the center of all these ideas lies a single structure: Generate → Evaluate → Improve → Repeat, which is the closed loop.

This structure matters because it transforms AI from a generator into an optimizer. Generating something once is already a solved problem. What actually matters in real work is reaching a required level of quality. And that is only possible when there is a verifiable evaluation function—a reward signal the AI can optimize against without relying on human judgment. Autoresearch works precisely because validation loss provides such a clear and objective function.

The next question, then, is how this concept can be applied to general office work—the next frontier after software. Let’s take service planning as an example. Traditionally, the instruction is vague: “write a good service plan.” The problem is that the evaluation criteria are unclear, so humans rely on intuition to judge quality. In this setting, AI can generate a draft, but it cannot iteratively improve it in a meaningful way.

To make service planning truly work with AI, the first step is to explicitly define how the plan will be evaluated. In other words, we must transform a service plan from something that is “read” into something that is testable.

For example, a service plan can be defined with the following evaluation function:

# Service Planning Evaluation Function

Score =
- Problem clarity (0~5)        # Clarity of the problem being solved
- User value (0~5)             # Magnitude and specificity of user value
- Business impact (0~5)        # Impact on revenue or key metrics
- Execution feasibility (0~5)  # Practical feasibility of implementation
- Metric definition (0~5)      # Presence of measurable KPIs

Reject Conditions:
- Target user is not defined
- No KPI is specified
- Core functionality is unclear

Acceptance:
- Total score ≥ 15

At this point, service planning becomes a completely different problem. AI is no longer just “writing” a plan—it can now iteratively improve it based on the score. If the first draft is penalized for missing KPIs, the next iteration adds them. If the user definition is vague, it gets refined. The system naturally evolves by removing rejection reasons one by one.

This structure can be mapped to Karpathy’s autoresearch. In code, it looks like this:

Recommended by LinkedIn

You Can Build an Agentic AI Workflow. No Coding, No…

Rich Blumberg 2 weeks ago

The better the engineer, the worse their first AI…

KD DESHPANDE 3 months ago

Why Prompt Engineers could be the next big role in…

LOOP Agencies 2 years ago

# Autoresearch Loop

while True:
    modify(train.py)              # Generate
    result = run_training()       # Execute
    score = evaluate_loss()       # Evaluate

    if score improved:
        commit()
    else:
        revert()

And when applied to service planning:

# Service Planning Loop

while True:
    draft = generate_plan()                # Generate
    score, reasons = evaluate_plan(draft)  # Evaluate

    if score >= threshold:
        break

    draft = improve_plan(draft, reasons)   # Improve

The mapping is straightforward:

train.py ↔ service plan
validation loss ↔ evaluation score
code modification ↔ document revision
experiment ↔ review

In other words, the same closed loop that works in software can be directly applied to service planning and other forms of knowledge work.

The key question then becomes: how well can we define the evaluation criteria? Every company already has standards for what makes a good plan, a good report, or a good decision. Sometimes these are even publicly documented. But in most cases, they exist as tacit knowledge, interpreted differently by different individuals. The moment we make these criteria explicit, quantifiable, and testable, that work becomes something AI can optimize through a loop.

The direction forward is clear. Instead of a structure where agents produce work and humans review it, we are moving toward a structure where agents complete the work end-to-end. Humans will define the evaluation criteria and design the environment in which those criteria are enforced. Just as software was the first domain AI could transform because it was inherently testable, the next competitive frontier will be how much of digital work we can make testable—how much of it we can bring into a closed loop.

To view or add a comment, sign in

From human knowledge work to agent work, powered by verifiable closed-loop

Junghoon Woo

Recommended by LinkedIn

More articles by Junghoon Woo

Others also viewed

From Prompt to Production — Building a PDF Chatbot with Local LLMs

The Reality Check: Where Agentic AI Excels and Where Human Expertise Still Reigns

Elevating Enterprises: A Strategic Guide to Implementing OPEN AI's Assistant API for Intelligent Business Solutions

The Emergence of New AI Roles

AI is Just Software: Why Your Digital Assistant Still Needs a Human Boss

Beyond the Hype: Why System Design Principles are Critical for Building Trustworthy LLMs (and Avoiding the "Magic Black Box" Trap)

How Much Control Should We Give Up for AI Efficiency?

The Secret to Better AI Output Isn’t a Better Prompt

Agentic Playbook #9: Don’t Let the Tool Think for You

Clarity is Key: How a Well-Crafted AI Prompt Mirrors a Good Functional Requirement

Explore content categories

Recommended by LinkedIn

More articles by Junghoon Woo

Why is LLM-Wiki creating such a strong reaction now? - Beyond note-taking, the arrival of an era where knowledge is operated with AI

The Next Stage of Knowledge Work: From “Humans Learning” to “AI Utilizing”

After Coding: Where AI Productivity Goes Next? "Testable Knowledge Work"

Make What Agents Want: Building an AI-Friendly, Agent-First Company with Docs-as-Code—and the Rise of the 10X Knowledge Worker

Your PC vs The Enterprise Cloud: The Two Futures of AI

After the 10X Engineer, the Era of the 10X Knowledge Worker

From Type1 to Type2: Redesigning Enterprise Work for the AI Era

Different Origins, Same Direction: OpenClaw and the Local AI Turn in Enterprise AI

From Passing Files to Operating States: AI Transformation with an AI Operating Spec

From Fear to Focus: What the Claude Cowork Shock Reveals About Vertical AI

Others also viewed

From Prompt to Production — Building a PDF Chatbot with Local LLMs

The Reality Check: Where Agentic AI Excels and Where Human Expertise Still Reigns

Elevating Enterprises: A Strategic Guide to Implementing OPEN AI's Assistant API for Intelligent Business Solutions

The Emergence of New AI Roles

AI is Just Software: Why Your Digital Assistant Still Needs a Human Boss

Beyond the Hype: Why System Design Principles are Critical for Building Trustworthy LLMs (and Avoiding the "Magic Black Box" Trap)

How Much Control Should We Give Up for AI Efficiency?

The Secret to Better AI Output Isn’t a Better Prompt

Agentic Playbook #9: Don’t Let the Tool Think for You

Clarity is Key: How a Well-Crafted AI Prompt Mirrors a Good Functional Requirement

Similar topics

Creating a Feedback Loop for AI Recommendation Systems

Explore content categories