Thumbnail for Anthropic Just Dropped the New Blueprint for Long-Running AI Agents. by The AI Automators

Anthropic Just Dropped the New Blueprint for Long-Running AI Agents.

The AI Automators

16m 59s3,316 words~17 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes
[0:00]Yesterday, Anthropic published a fascinating article on harness design for long running agents.
[0:00]And there are some interesting insights and honest admissions from the team that really can help us all when we're building specialized agent systems.
[0:00]In the post, they demonstrated how they built a 2D retro game engine over a six-hour autonomous coding session.
[0:00]And they also built a digital audio workstation in browser over a four-hour period.
Use this transcript
Related transcript hubs

[0:00]Yesterday, Anthropic published a fascinating article on harness design for long running agents. And there are some interesting insights and honest admissions from the team that really can help us all when we're building specialized agent systems. In the post, they demonstrated how they built a 2D retro game engine over a six-hour autonomous coding session. And they also built a digital audio workstation in browser over a four-hour period. And while these examples of long running autonomous agents are specific to coding, these principles apply to all types of specialized agent systems like compliance audits, risk analysis, content pipelines, and impact assessments. Last week, I published a video on this channel where I built a specialized agent harness into a custom Python and React app. And a custom harness like this could definitely be improved with some of the insights from Anthropic's blog post. And if you're not really sure what a harness is, it's essentially just the software and structure that wraps an AI model to keep it on track. It's essentially an orchestration layer that includes the prompts, the tools, the feedback loops, the constraints, the validation. So it's everything around the model that turns it into a reliable system. One analogy for an agent harness is that of a car, where the model is the engine, and the car itself is the harness. So without the car, the engine just sits there revving and you're not really getting anywhere. While actually a better analogy is that of a horse and an actual harness. A wild horse has raw power, but it'll go wherever it wants. While the harness allows you to control the power, set it in a direction and get where you want to go. And one of the key insights from this post is that for long-running complex tasks, the harness design is as important as the model itself. Yesterday's article builds on their previous post from last November, which talks about effective harnesses for long running agents. And this article tackles the fundamental problem that the entire industry is trying to solve, which is how do you give an AI agent a complex goal and let it work away over hours or even days to achieve that? And this is where real value can be created. So from a dev perspective, it could be one shot in a large feature or even an app itself. But in other industries, it could be a case of carrying out compliance audits that would normally take a full month worth of effort. And the problem to be solved without a harness is that an AI agent might try to one-shot an entire app build in a single go. It might run out of context halfway through. It might leave work half-finished and undocumented. Or I might declare that the job is done early just to finish up. So Anthropic's original solution was a two-part system, an initializer agent that set up the environment and broke the project into features. It then created a progress tracking file and then a coding agent worked one feature at a time, committing to Git after each chunk, and then it would leave clear artifacts for the next coding agent to take over. So that way you're decomposing the work, you're making incremental progress and you're handing off context cleanly. And it's fair to say that Anthropic were not the first to come up with this idea of long running agent harnesses. Jeffrey Huntley came up with the Ralph Wigam loop a few months before, which essentially allows you to run an agent within a loop and to check the output against something that can't actually lie. So that could be a linter or a type checker, and it essentially keeps the loop progressing until it's done. So having these explicit stop conditions means that you can loop over and over with an agent to really make sure that it's actually finished the job. And then that gets more powerful if you bundle it with the likes of spec driven development. For frameworks like Bmad or spec kit or open spec, allow you to create structured requirements before the actual dev begins. And that way the agent isn't looping in isolation, it's working against a predefined plan. So these frameworks solve the problem of an agent underscoping the work to be done. But outside of external hard validation, it is the agent that's actually self-evaluating its work. So against that backdrop, even with these approaches of an agent harness and a Ralph Wigam loop, Anthropic observed two common failure modes when agents executed against these types of tasks. And interestingly, these failure modes apply whether you're building an app or you're carrying out more general purpose work like a research pipeline or a content pipeline. And the first is what's known as context anxiety. As the context window fills up, the models don't just lose coherence, they actually change their behavior. They start wrapping up the conversation prematurely, they rush through steps and declare that things are done when they're not actually done. And you've probably noticed this yourself if you're chatting to an LLLM in a single context window over a long time period, it eventually gets shorter and shorter with you. And there is a technique called context compaction, where the actual conversational thread is compacted and summarized to leave more room, more space for usable context. But Anthropic found that even with context compaction, models like Sonnet 4.5 still showed context anxiety and tried to finish early. And the reason for that is you're not starting with a clean slate. And that is why their original solution last November was a context reset. You start with a fresh context window, read the latest feature from the progress file, test features that were previously built and then build out your specific task. Once you're finished, triggered the structured handoff and then start with the next agent with a clean slate. And as I mentioned, they found that Sonnet 4.5 exhibited these context anxiety symptoms and that's why they built this context reset system. Whereas Opus 4.5 does not have this problem to the same extent. And interestingly, in this blog post, when they moved to Opus 4.6, they found they didn't need to context reset at all. They were able to rely on context compaction and not have these anxiety symptoms where the L M was trying to quit early. And what's interesting is Anthropic recently brought out Opus 4.6 with the 1 million token context window. And they claim that the retrieval quality generally holds up over the longer context. However, to be honest, I would take a bit of a cynical view on this. It's in Anthropic's best interest to process tokens. So I'm sure they're delighted if you're sending in a 1 million token request every time, even if some of that's going to be cached. So I definitely don't think it's the end of context resets the way they've been designed today. The second failure mode is quite interesting, which is poor self-evaluation. Anthropic haven't talked about this before, but essentially if you ask an agent to evaluate its own work, it's likely going to praise it. And as the engineer said, even if the quality is obviously mediocre to a human observer. So I think these are some interesting admissions. They talk about how Claude often produced outputs from a front-end design perspective that were bland at best. And when they were building this system, they penalized highly generic AI slop patterns. And this idea of self-evaluation is a tricky subject because it's different for subjective versus objective tasks. In this article, they were focused on front-end design, try to make subjective quality gradable. And that's the real challenge for even non-AI coding based use cases. How can you evaluate the quality of the writing style of an AI or the visual design of a graphic or the professionalism of a legal analysis, for example. And I think that's why AI coding use cases have taken off so much in the last 12 months, because you have verifiable outputs where you can run linters, you can run type checkers, regression tests, browser tests, and essentially the AI can iterate on its own output. So this idea of making subjective qualities gradable means that an AI can actually evaluate them in a more objective way. Which then leads us to the main solution that Anthropic came to in this post, which is the idea of adversarial evaluation. So inspired by GAN networks where you would have a generator and a discriminator, here we have a generator agent which creates the code, creates the content, and you would have an evaluator agent whose job it is to judge the work, and then ideally grade it so that it's somewhat objective, and then they can send back that feedback to the main agent. And like a GAN network, the idea here is the tension between these agents should improve the quality. And they found that it was far harder in isolation to get the likes of a generator agent to be more skeptical about the work that they just created, versus having a dedicated QA or evaluator agent whose system prompt is dedicated to being skeptical about the work that it was just handed. And then once the generator agent has provided the feedback from the evaluator, it has something concrete to then iterate on. So I know what you're thinking, this is absolutely not new, multi-agent systems have existed for years now at this point. And the idea of an evaluator agent is no different to LLM as a judge, for example. But I think the difference here is you essentially have an evaluator wired into a production loop, rather than used as an ad hoc evaluation tool, let's say. And from looking at the blog post, it really wasn't plain sailing, and another interesting admission, out of the box, Claude is a poor QA agent. And they talk about how in early runs, Claude identified legitimate issues and then talked itself into deciding they weren't a big deal and approved the work anyway. And how it also tended to test superficially rather than probing edge cases, some more subtle bugs often slipped through.

[9:04]So the evaluator agent took multiple rounds of iteration and refinement. It really wasn't a plug and play solution. And having gone through that experience, they found that there was three things that were required to really make an evaluator agent work. And the first one I've already touched on, which is the idea of making subjective quality gradable. From a front-end design perspective, as opposed to just asking is this design beautiful? Instead, it's does this follow our principles for good design, and then they define the principles. So the grading criteria they created, one was on design quality, another on originality, to avoid the AI slop, another on craft, which is technical execution, and another on functionality. The second learning was the need to weigh the criteria towards the model's capabilities. As they found that Opus scored well on two out of those four criterias, but struggled on the other two. And through iteration, they ended up waiting that criteria heavier to avoid AI slop patterns. The third learning was that you have to let the evaluator interact with the output. So they used the play right MCP so that the evaluator has tools it can use to actually navigate the app, screenshot it, and test it like a real user. And that's not to say there's not still room for more deterministic testing. In my last video, I talked about how Stripe created their concept of minions, where each code change results in testing against a subset of their 3 million tests in their test suite. That's not been prompted by an evaluator agent, that is hard baked, hard coded into their deployment system. So within the case study, they carried out three experiments. The first one was on front-end design, where they prompted the model to create a website for a Dutch art museum. And as you can see here, these are some of the earlier iterations, and after 10 rounds of feedback, they ended up with something that was completely unique, which is essentially a 3D room with a checkered floor. And they talked about how that was kind of a creative leap that I hadn't seen before from a single pass generation. Which is interesting when it comes to iterating on something that is so subjective like front-end design. So this is what that design harness looked like. It was essentially a single kind of one-sentence prompt, and then that went to a generator agent, which created HTML, CSS, JavaScript, goes to the evaluator agent, who has the ability to interact with it via Playwright MCP, and then between 5 to 15 iterations of feedback, you end up with your finished product. And obviously, all of these harnesses were built on the Claude agent SDK. But that's not to say that you can't use the learnings of how you create these effective harnesses, and build them on competitor models or local open source models. So the second experiment was to scale up to full-stack coding. So here they introduce the planner agent. And the prompt here was to create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode. And they run this twice, one with a solo harness and another with a full harness of the planner, generator and evaluator. Obviously, a lot more expensive and a lot longer to run the full harness. But they found that from the solo run, which was a lot cheaper and faster, the game basically didn't really work. You know, it looked okay, but it was not actually playable in any sense. Whereas the game built by the full harness actually was functional. So within this version of the harness, that went to the planner agent, which then dramatically expanded out the spec and it split it into sprints. Now we're using Opus 4.5 here, which is important. But within each sprint, there was contract negotiations between the generator agent and the evaluator agent. It was defining the definition of done upfront before it actually started building anything. So that way the generator agent couldn't move the goal posts halfway through the build to say, ah, it's done. So then the evaluator agent evaluated once per sprint, and then there was a context reset and a handoff to the next sprint. So not having a planner agent would mean that you would dramatically underscope the project, because you've only sent in a single sentence of a game that you want built. And without the evaluator agent, the generator would just over approve all its work and quit early. Then during these experiments, Opus 4.6 was released. So they created a second version of the harness, and they decided to simplify things. Because to build effective agents, you should always look to find the simplest solution possible, and not actually over complicate or over engineer it. So they had some trial and error of trying to simplify the architecture, and they ended up with this. They removed the sprints, they removed the contract negotiation, they removed the context resets, so they relied on context compaction. So again, we had a single sentence prompt to build an app, went to the planner agent, expanded it into a much larger spec, and then that full spec was passed to the generator agent to build to essentially one-shot the entire app, one continuous session, and it was relying on the Claude agent SDK, which has context compaction built into it. So the evaluator agent in this case only ran at the end of the full build, and that would then provide feedback so that it could then iterate on that feedback. So I think there is a level of Anthropic just showing off how good Opus 4.6 is, that it can build an entire app, that you don't need these harnesses. And at a 1 million token context window, it's not exactly going to be cheap. But they do seem convinced that the context rot problem and the context anxiety problem don't exist in this version of the model. So for this version two of the harness, the prompt was to build a fully featured doll in the browser using the Web audio API. So this is a digital audio workstation, and here is the phase-by-phase breakdown. So the planner agent took 5 minutes and 50 cent, and then the first full build took 2 hours and $71. The evaluator or the QA agent took 10 minutes essentially, and then another hour for the second build, and then 10 minutes for the third build. All in all around 4 hours total, $125. And of course, all relying on the Claude agent SDK, which has context compaction so that you can keep it running over the long term. Now, of course, you could build this yourself, you don't need to use the Claude agent SDK. You could build the functionality that you need for these agents to run over the long horizon. And so this is the finished product. They did want a fully functional digital audio workstation. I don't think this is fully functional in the sense that as they say, it's far from a professional music production program. That being said, it cost $100 to create, so maybe that's not to be expected either. But it does seem to have some of the key features that you would actually need. And this brings about one of the big take home points from this article, which is the idea of a harness evolution. Because every component in a harness essentially encodes an assumption that the model can't actually carry out that task itself. And the context reset is a perfect example of that. Sonnet 4.5 had context anxiety, so we had to build an entire handoff of context in a harness. So what Anthropic are saying is that those assumptions go stale as the models improve, as they saw with the leap to Opus 4.6. And they also found that the evaluator's work depended on how good the model was as well. So if you were really stretching the model to its limits, it really is important to have an evaluator to make sure it's done everything it needs to do. Whereas if you're asking the model to do something that's very much in its wheelhouse, you might not need an evaluator agent at all. And I suppose this is the real work of harness and context engineering because it's not ever a one-shot setup, you do need to refine and iterate as you go. As I mentioned earlier, we're in the middle of an AI builder series on this channel, where we build out a full-stack AI agent platform. And it's not using Claude agent SDK, it's a completely custom Python and React app with a Super base backend. And last week we built in a specialized contract review harness into this system. I'll leave a link for it in the card above. And if you're interested in going beyond the theory and actually building these harnesses for yourself, then check out the link in the description to our community, the AI Automators. Where you can access our full AI Builder course and code base. We have a private community here of serious builders all creating specialized harnesses and advanced rag systems. We'd love to see you here, so check out the link below. Thanks so much for watching and I'll see you in the next one.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript