Addy Osmani’s Post

Long-running AI agents: What changes when your agent runs for days? My latest free write-up: https://lnkd.in/gZwaubjg ✍ Today's agents can are increasingly capable but have a ceiling - they often run for many minutes. What about agents that run for hours or days? They can own larger features, execute bigger migrations that have been on the backlog for six quarters, or complete an overnight research sweep. But to cross that threshold, every engineering team eventually hits three distinct walls: 1️⃣ Finite context (even 1M tokens fill up, and context rot sets in early) 2️⃣ No persistent state (starting a new session is like a shift change with amnesia) 3️⃣ No self-verification (models skew positive and grade their own homework too generously) Across the industry - from the architectures emerging at Anthropic and Cursor to what we are building at Google - there is a rapid convergence on how to break through these walls. It requires moving away from the simple chat loop and fundamentally redesigning how agency works. In the full post, I unpack the engineering behind this shift, including: - Why you must decouple the "brain" the "hands" and the "session." - How to force state to live outside the model's context window. - The patterns that separate working, resilient agents from fragile demos. If you are moving beyond the initial novelty of vibe coding and getting serious about agentic engineering, the hardest problems aren't just in the model anymore - they are in the state, sessions, and structured handoffs wrapped around it. Dive into the link above to see how to actually build these systems today. #ai #programming #softwareengineering

21 Comments

AUROBINDA MONDAL 2d

In my experience, getting a multi-step agent workflow to run autonomously for more than an hour often requires human checks every few steps. That constant intervention highlights how much effort goes into compensating for context decay and state loss today. Addy Osmani

3 Reactions

Sergei Chukh 2d

Days? You know, chattel slavery was formally abolished. Poor agents 🤣

Максим Иванов 2d

Thank you for reat read. Is there any difference between articles on substack and your blog?

1 Reaction

Kasper Filstrup 2d

Super interesting Addy. Good read 🙌

1 Reaction

Jono Herrington 2d

The three walls are real. We're hitting all of them. But there's a fourth worth naming: environment drift. Your agent made decisions at hour 1 based on state that doesn't exist at hour 8. Another engineer pushed a migration. A schema changed. An upstream service shifted behavior. Context rot is about the model's window filling up. Drift is about the world moving underneath the agent while it's still running. Both will kill a long-running agent in a real production system. Only one is being talked about. The fix isn't just externalizing state ... it's building agents that treat their own prior decisions as potentially stale and know when to re-verify against live reality instead of cached assumptions from six hours ago.

1 Reaction

Mahdi Khakbazan 1d

This is the paradigm shift we’ve been waiting for. Moving beyond the novelty of agents into serious agentic engineering requires exactly this kind of structural redesign. The idea of agents having ‘amnesia’ between sessions is the biggest hurdle for production-grade tools. Great breakdown on why we need to move the state outside the model’s immediate context!

Ryan Nelson 18h

Great breakdown of the three walls. There is a fourth one that hits hardest in regulated environments — no pre-execution proof. Long running agents that operate for hours or days across multiple sessions create a compounding audit problem. By the time something goes wrong there is no cryptographic record of what the agent was authorized to do at the start of each session. The logs exist but they are post-hoc. An auditor or regulator wants proof that predates the action not a reconstruction of what happened. The PocketOS incident last week is the short session version of this problem. A long running agent that operates overnight across six quarters of migration work makes the audit surface enormous. Delegation receipts solve this — signed authorization before each session starts, published to a tamper-evident log before the agent touches anything. Works with the decoupled brain and hands architecture you described because the receipt is session-scoped not context-window-scoped. Filed an IETF Internet-Draft on this last week — draft-nelson-agent-delegation-receipts-04. cloud.authproof.dev

Mikita Aliaksandrovich 10h

Where do you think the real bottleneck shows up first in long-running agents - memory, planning, or reliability over time?

Asfandyar Fakher 2d

Most agents today run for minutes. The real unlock is agents that run for days owning full features, completing six-quarter backlog migrations, running overnight research sweeps. But three walls stop almost every team: context fills up, state resets between sessions like amnesia, and models grade their own work too generously. Decoupling the brain, hands, and session is where serious agentic engineering actually starts. Worth reading the full breakdown.

Sid Choudhury 1d

Thank you for writing down what we agent builders have been experiencing for the last few months. I have been personally diving into how enterprise SRE teams will run long-running agents towards the end goal of self-driving production. They treat cost and security concerns as important as accuracy and highlight that these are blockers to even get started. Excited about the future we build for them.

See more comments

To view or add a comment, sign in

More Relevant Posts

Durga Prasad Dunga
6d Edited
Report this post
Generative AI - What I Wish I Knew Before Building My First AI App When I built my first AI-powered application, I thought the hard part would be the model. Pick the right LLM, write a clever prompt, and watch the magic happen. I was wrong about almost everything. Here is what actually matters, and what tripped me up along the way. The concept: Building with LLMs is less like programming and more like managing a brilliant but unpredictable colleague. You cannot unit test your way to confidence the same way you do with traditional software. The outputs are non-deterministic. The same input can give you different results on Tuesday than it did on Monday. Why this matters: Most developers approach their first AI app with a software engineering mindset that assumes predictable behavior. When the model "hallucinates" or drifts in quality, they panic. They rewrite prompts obsessively. They bolt on layers of complexity trying to force determinism. I did all of this. A real example from my own experience: I was building a content generation tool. My first version had one massive prompt trying to do everything -- tone, structure, length, topic adherence. It worked maybe 60 percent of the time. The fix was not a better prompt. It was breaking the task into smaller, verifiable steps. Generate an outline. Validate the outline against constraints. Then expand each section. Suddenly I could catch problems between steps instead of staring at a final output wondering where things went sideways. The key thing most people get wrong: They over-invest in prompt engineering and under-invest in everything around the model. Error handling, fallback strategies, output validation, cost monitoring, latency budgets. The model call is maybe 20 percent of your application. The other 80 percent is the unglamorous scaffolding that makes it production-ready. A few things I would tell myself on day one: Start with the simplest model that could work. You can always upgrade later. Log everything. You will need those logs when debugging outputs three weeks from now. Build for failure. The API will timeout. The response will be malformed. Have a plan. Measure cost per request early, not after your first invoice. What surprised you most when you built your first AI-powered app? I am genuinely curious whether others hit the same walls I did. #GenAI #LLM #AI #MachineLearning #ArtificialIntelligence
Like Comment
To view or add a comment, sign in
Somanath Sahoo
3w
Report this post
Is Vibe Coding the best way to build software… or the fastest way to create technical debt ? Build fast using intuition and AI assistance, then refine with proper engineering. 🔹 Start without overplanning 🔹 Use AI to generate, debug, and iterate 🔹 Convert ideas → working prototype quickly Works best for: • MVPs • Learning • Rapid experimentation Not enough for production alone — requires refactoring, testing, and design later. #AI #ArtificialIntelligence #GenerativeAI #AITrends #AIDevelopment
3 Comments
Like Comment
To view or add a comment, sign in
KodeMaster AI

1,135 followers
1w
Report this post
Anyone can ask an AI to write a function. In 2026, the market doesn't need more "prompt engineers." It needs System Orchestrators. The trap: - Copy-pasting code you don't understand. - Building fragile apps that break under load. - Relying on AI to do the thinking, not just the typing. The solution: - Master the architecture. - Understand the trade-offs. - Own the entire system lifecycle. Engineering isn't about generating lines of code. It’s about building resilient, scalable systems that solve real problems. At KodeMaster AI, we push you beyond the prompt. 🚀 🛠️ Build in your own editor. 📈 Get instant feedback on your logic. 🧠 Master complexity analysis to see if your code actually scales. Don't just watch tutorials. Don't just paste from a chat window. Start building for the real world. Stop prompting. Start orchestrating. #SoftwareEngineering #TechCareer #LearnToCode #AI #KodeMasterAI #DevTips
Like Comment
To view or add a comment, sign in
Aleksandar Perisic
2w
Report this post
I spent last week trying to give an AI agent a permanent brain. Not a chatbot. A coding agent that remembers what it learned, why it made decisions, and what failed. Across sessions, across days. Here's what I actually learned: → Context windows are not memory. You can stuff 200K tokens into a prompt. That's not remembering, that's cramming before an exam. Real memory needs structure: what the agent must always know, what failed and why, what decisions were made and the reasoning behind them. → The folder matters more than the model. I designed a directory structure that acts as a "project brain", sitting right next to the agent's config. Stages, persistent truths, failure logs, decision records. The agent reads its state file on startup and knows exactly where things stand. No re-explaining. → There are two schools of AI memory right now. One approach: auto-index everything into markdown files using lifecycle hooks. Zero-token overhead, git-friendly, fully readable. The other: store memories in a database, expose them via tools, let the agent decide what's relevant. The second approach uses a trick called progressive disclosure. Inject an 800-token index at session start instead of 35,000 tokens of raw history. The agent fetches only what it needs. ~920 tokens, near 100% relevance. → Multi-agent setups need a single source of truth. I built a 3-agent pipeline (researcher → writer → editor) and the thing that made it work wasn't the agents. It was one shared config file that defined voice, rules, and constraints. Without it, each agent drifts into its own style within 2 runs. → The real unlock is hooks, not prompts. Claude Code has 26 lifecycle hook events. PreToolUse, PostToolUse, SessionStart. You can intercept, validate, and inject context at every stage. This is where agent architecture actually lives. Not in the system prompt. I'm nowhere near done. But the gap between "AI assistant" and "AI that actually learns" is smaller than I thought. It's an engineering problem, not a model problem. #buildinpublic #claudecode #contextengineering

3 Comments
Like Comment
To view or add a comment, sign in
Brandon Sickler
1w Edited
Report this post
The most interesting part of building with AI isn’t the generation. It’s the debugging. I’ve been using AI as a learning lab while building out an inventory management tool, and it’s honestly been a great way to sharpen my own logic. It’s incredibly satisfying to catch a flaw in the AI's reasoning before it becomes a problem. For example, it recently suggested using a simple array for a collection that I know will grow to 50,000+ items. I knew that would eventually tank the app's performance, so I stepped in and swapped it for a more efficient data structure to keep lookups fast. I’m seeing similar things with database schema logic and data parsing. These moments are what turn coding into architecting. It’s a solid reminder that regardless of the tools we use, the efficiency gains only matter if you're applying your own critical thinking. It’s about catching technical debt before it even has a chance to start. #NHLA #PennState #SoftwareEngineering #BuildingInPublic #TechnicalDebt
Like Comment
To view or add a comment, sign in
Aniekutmfon E
2w
Report this post
4 habits that made me a significantly better engineer this year: 1---- Stop treating AI like a search engine The biggest shift for me was moving from "give me the code" to "here's the context, the constraints, and what I've already tried." The output quality difference is night and day. 2---- Keep your context tight, not long I used to dump everything into the prompt. Now I share only what's relevant to the specific task. Shorter, sharper context = better reasoning from the model every single time. 3---- Always review the plan before the code For anything touching more than one file, I ask for a plan first. It catches misunderstandings before they become bugs. This alone has saved me hours of backtracking. 4---- Use AI for the boring 80%, own the critical 20% Boilerplate, test scaffolding, first-draft APIs, documentation. Let AI handle those. System design decisions, production edge cases, security tradeoffs. That stays with you. The engineers winning with AI right now are not the ones prompting the hardest. They are the ones who know exactly where human judgment still matters. What habit has changed how you work most this year? #SoftwareEngineering #AIEngineering #DeveloperProductivity #BuildInPublic #TechLeadership #CodingWithAI #DevWorkflow
Like Comment
To view or add a comment, sign in
Ix

66 followers
6d
Report this post
63% of developers spend more than 30 minutes a day just searching for answers inside their own codebase, and not searching the internet, not debugging, just trying to understand what's already there. That's 2.5 hours a week per developer spent not building anything, just navigating a system that was never fully mapped in the first place, and the bigger the codebase gets, the worse it becomes. More history, more decisions nobody documented, more context that lives in someone's head or nowhere at all. We keep thinking AI will fix this, but if AI has never actually seen your system, just your code, is it actually solving the right problem? How much of your week goes to understanding vs. actually building? #devtools #softwareengineering #AI #techleadership #engineeringteams
1 Comment
Like Comment
To view or add a comment, sign in
Nikhil Sehgal
3w
Report this post
I wasted almost two hours last month watching an AI agent confidently refactor the wrong part of a codebase. Not because the model was bad. Because I gave it zero context about how that repo actually works. That one experience changed how I set up every project now. Before I run any AI agent on a codebase, I create one file. I call it AGENTS.md. It sits in the root of the repo and answers four things: → What does this service actually do? → What conventions does this codebase follow? → What commands does the agent need to know? → What are the common mistakes to avoid here? Two pages. Plain markdown. That's it. The difference in output quality is not subtle. The agent stops guessing and starts contributing. The mental model I use: imagine a strong engineer joining your team tomorrow with zero context. What's the first doc you'd hand them? Write that doc. Give it to your agent. What does your current AI setup look like when you start a new session? Curious how others are handling this. #AIEngineering #DeveloperProductivity #SoftwareEngineering #AItools #CodingWithAI #TechLeadership #BuildInPublic #DevTools
Like Comment
To view or add a comment, sign in
Padding Media

702 followers
2w
Report this post
In 2026, being a "senior" developer isn't just about how much syntax you know; it's about how effectively you can orchestrate AI to handle the heavy lifting. If you’re still manually refactoring across ten different files or you’re leaving hours of productivity on the table. We’ve moved past simple chat boxes into the era of Agentic IDEs and CLI-native intelligence. Here are the 5 tools currently defining the high-performance stack: Which of these has made the biggest dent in your workflow lately? Let’s discuss in the comments. 👇 #softwareengineering #ai #webdev #productivity #cursor #techtrends2026 #uiuxUIUX
Like Comment
To view or add a comment, sign in
Tausif Ahmed
2d
Report this post
AI just crossed a line most people didn’t notice. GPT-5.4 scored 75% on a benchmark that simulates actual desktop work. Switching between apps, filling forms, compiling reports. The human baseline on that same benchmark? 72.4%. The AI is now operating computers better than the average person using them. I’ve been building enterprise AI systems for a few years now and honestly, this one hit different. we’re not talking about a chatbot that answers questions anymore. We’re talking about an agent that opens Slack, reads a thread, updates a spreadsheet, drafts the follow-up email, and sends it. Without you touching a single thing. Stripe already has AI agents generating over 1,300 pull requests per week in production. Not prototypes. Not demos. actual production code, reviewed and merged. And IDC says 80% of enterprise apps will have agents embedded in them by end of this year. Here’s whats wild to me as an AI engineer: the hardest part of building these systems is no longer the LLM. Its the orchestration. How do you give an agent the right tools? How do you keep it from doing something unexpected at step 7 of a 12-step workflow? How do you log what it did and why? These are the real engineering problems of 2026. Not prompt engineering. Agent reliability in production. If you’re building products right now and you haven’t thought about where an AI agent fits, you’re probably already behind. #GenerativeAI #AIAgents #LLM #AIEngineering #ArtificialIntelligence #Python
Like Comment
To view or add a comment, sign in

267,332 followers

2,341 Posts

View Profile Follow

Addy Osmani’s Post

More Relevant Posts

Explore content categories