Addy Osmani’s Post

Long-running AI agents: What changes when your agent runs for days? My latest free write-up: https://lnkd.in/gZwaubjg ✍ Today's agents can are increasingly capable but have a ceiling - they often run for many minutes. What about agents that run for hours or days? They can own larger features, execute bigger migrations that have been on the backlog for six quarters, or complete an overnight research sweep. But to cross that threshold, every engineering team eventually hits three distinct walls: 1️⃣ Finite context (even 1M tokens fill up, and context rot sets in early) 2️⃣ No persistent state (starting a new session is like a shift change with amnesia) 3️⃣ No self-verification (models skew positive and grade their own homework too generously) Across the industry - from the architectures emerging at Anthropic and Cursor to what we are building at Google - there is a rapid convergence on how to break through these walls. It requires moving away from the simple chat loop and fundamentally redesigning how agency works. In the full post, I unpack the engineering behind this shift, including: - Why you must decouple the "brain" the "hands" and the "session." - How to force state to live outside the model's context window. - The patterns that separate working, resilient agents from fragile demos. If you are moving beyond the initial novelty of vibe coding and getting serious about agentic engineering, the hardest problems aren't just in the model anymore - they are in the state, sessions, and structured handoffs wrapped around it. Dive into the link above to see how to actually build these systems today. #ai #programming #softwareengineering

  • No alternative text description for this image

In my experience, getting a multi-step agent workflow to run autonomously for more than an hour often requires human checks every few steps. That constant intervention highlights how much effort goes into compensating for context decay and state loss today. Addy Osmani

Days? You know, chattel slavery was formally abolished. Poor agents 🤣

Like
Reply

Thank you for reat read. Is there any difference between articles on substack and your blog?

Super interesting Addy. Good read 🙌

The three walls are real. We're hitting all of them. But there's a fourth worth naming: environment drift. Your agent made decisions at hour 1 based on state that doesn't exist at hour 8. Another engineer pushed a migration. A schema changed. An upstream service shifted behavior. Context rot is about the model's window filling up. Drift is about the world moving underneath the agent while it's still running. Both will kill a long-running agent in a real production system. Only one is being talked about. The fix isn't just externalizing state ... it's building agents that treat their own prior decisions as potentially stale and know when to re-verify against live reality instead of cached assumptions from six hours ago.

This is the paradigm shift we’ve been waiting for. Moving beyond the novelty of agents into serious agentic engineering requires exactly this kind of structural redesign. The idea of agents having ‘amnesia’ between sessions is the biggest hurdle for production-grade tools. Great breakdown on why we need to move the state outside the model’s immediate context!

Like
Reply

Great breakdown of the three walls. There is a fourth one that hits hardest in regulated environments — no pre-execution proof. Long running agents that operate for hours or days across multiple sessions create a compounding audit problem. By the time something goes wrong there is no cryptographic record of what the agent was authorized to do at the start of each session. The logs exist but they are post-hoc. An auditor or regulator wants proof that predates the action not a reconstruction of what happened. The PocketOS incident last week is the short session version of this problem. A long running agent that operates overnight across six quarters of migration work makes the audit surface enormous. Delegation receipts solve this — signed authorization before each session starts, published to a tamper-evident log before the agent touches anything. Works with the decoupled brain and hands architecture you described because the receipt is session-scoped not context-window-scoped. Filed an IETF Internet-Draft on this last week — draft-nelson-agent-delegation-receipts-04. cloud.authproof.dev

Like
Reply

Where do you think the real bottleneck shows up first in long-running agents - memory, planning, or reliability over time?

Like
Reply

Most agents today run for minutes. The real unlock is agents that run for days owning full features, completing six-quarter backlog migrations, running overnight research sweeps. But three walls stop almost every team: context fills up, state resets between sessions like amnesia, and models grade their own work too generously. Decoupling the brain, hands, and session is where serious agentic engineering actually starts. Worth reading the full breakdown.

Like
Reply

Thank you for writing down what we agent builders have been experiencing for the last few months. I have been personally diving into how enterprise SRE teams will run long-running agents towards the end goal of self-driving production. They treat cost and security concerns as important as accuracy and highlight that these are blockers to even get started. Excited about the future we build for them.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories