Thumbnail for The 5 Levels of AI Coding (Why Most of You Won't Make It Past Level 2) by AI News & Strategy Daily | Nate B Jones

The 5 Levels of AI Coding (Why Most of You Won't Make It Past Level 2)

AI News & Strategy Daily | Nate B Jones

25m 54s3,599 words~18 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes
[0:00]Those are the first two principles of a real production software team called strong DM and their software factory.
[0:00]The system is designed to take a specification, build the software, test the software against real behavior scenarios and independently ship it.
[0:00]As I was saying, meanwhile 90%, and yes, it's true, over an anthropic 90% of cloud codes codebase was written by cloud code itself.
[0:00]Borisney, who leads the cloud code project at Anthropic, hasn't personally written code in months.
Use this transcript
Related transcript hubs

[0:00]90% of Claude code was written by Claude code. CoDex is releasing features entirely written by CoDex. And yet, most developers using AI empirically get slower, at least at first. The gap between these two facts is where the future of software lives. Imagine hearing this at work. Code must not be written by humans, code must not be even reviewed by humans. Those are the first two principles of a real production software team called strong DM and their software factory. They're just three engineers, no one writes code, no one reviews code. The system is a set of AI agents orchestrated by markdown specification files. The system is designed to take a specification, build the software, test the software against real behavior scenarios and independently ship it. All the humans do is write the specs and evaluate the outcomes. The machines do absolutely everything in between. As I was saying, meanwhile 90%, and yes, it's true, over an anthropic 90% of cloud codes codebase was written by cloud code itself. Borisney, who leads the cloud code project at Anthropic, hasn't personally written code in months. And Anthropic's leadership is now estimating that functionally 100%, the entirety of code produced at the company is AI generated. And yet, at the same time in the same industry with us here on the same planet, a rigorous 2025 randomized control trial by MITR found that experienced open source developers using AI tools took 19% longer to complete tasks than developers working without them. There is a mystery here. They're not going faster, they're going slower. And here's the part that should really unsettle you. Those developers are bad at estimation. They believed AI had made them 24% faster. They were wrong not just about the direction, but about the magnitude of the change. Three teams are running lights out software factories. The rest of the industry is getting measurably slower. Just a few teams around tech are running truly lights out software factories. The rest of the industry tends to get measurably slower while convincing themselves and everyone around them with press releases that they're speeding up. The distance between these two realities is the most important gap in tech right now and almost nobody is talking honestly about it and what it takes to cross it. That is what this video is about. Dan Shapiro, the CEO over at Glow Forge and the veteran of multiple companies built on the boundary between software and physical products, just published a framework earlier this year in 2026 that maps where the industry stands. He calls it the five levels of vibe coding and the name is deliberately informal because the underlying reality is what matters. Level zero is what he calls spicy autocomplete. You type the code, the AI suggests the next line, you accept or reject. This is GitHub co-pilot in its original format, just a faster tab key. The human is really writing the software here and the AI is just reducing the key strokes and the effort your fingers have. Level one is coding intern. You hand the AI a discrete, well scoped task, you write the function, you build the component, you refactor the module, that's the task you give the AI. You hand the AI a discrete and well scoped task, like write this function or build this component or refactor this module. You then review as the human everything that comes back. The AI handles the tasks, the human handles the architecture, the judgment and the integration. Do you see the pattern here? Do you see how the human is stepping back more and more through these levels? Let's keep going. Level two is the junior developer. The AI handles multi-file changes. It can navigate a codebase, it can understand dependencies, it can build features that span modules. You're reviewing more complicated output, but you as a human are still reading all of the code. Shapiro estimates that 90% of developers who say they are AI native are operating at this level and I think from what I've seen, he's right. Software developers who operate here think they're farther along than they are. Let's move on. Level three, the developer is now the manager. This is where the relationship starts to flip. This is where it gets interesting. You're now not writing code and having the AI help, you're simply directing the AI and you're reviewing what it produces. Your day is whether you want to read, whether you want to approve, whether you want to reject, but at the feature level at the PR level. The model is doing the implementation, the model is submitting PRs for your review, you have to have the judgment. Almost everybody tops out here right now. Most developers, Shapiro says, hit that ceiling at level three because they are struggling with the psychological difficulty of letting go of the code. But there are more levels and this is where it gets spicy and exciting. Level four is the developer as the product manager. You write a specification, you leave, you come back hours later and check whether the tests pass. You're not really reading the code anymore, you're just evaluating the outcomes. The code is a black box. You care whether it works, but because you have written your Eval so completely, you don't have to worry too much about how it's written if it passes. This requires a level of trust both in the system and in your ability to write spec, and that quality of spec writing almost nobody has developed well yet. Level five, the dark factory. This is effectively a black box that turns specs into software. It is where the industry is going. No human writes the code. No human even reviews the code. The factory runs autonomously with the lights off. Specification goes in, working software comes out. And, you know, Shapiro is correct. Almost nobody on the planet operates at this level. The rest of the industry is mostly between level one and level three, and most of them are treating AI kind of like a junior developer. I like this framework because it gives us really honest language for a conversation that's been drowning in hype. When a vendor tells you their tool writes code for you, they often mean level one. When a startup says they're doing agentic software development, they often mean level two or three. But when strong DM says their code must not be written by humans, they really do mean level five, the dark factory and they actually operate there. The gap between marketing language and operating reality is enormous. And collapsing that gap into what is actually going on on the ground requires changes that go way beyond picking a better AI tool. So many people look at this problem and think, this is a tool problem. It's not a tool problem, it's a people problem. So what does level five software development actually look like?

[7:31]I think strong DM's software factory is the most thoroughly documented example of level five in production. Simon Willison, one of the most careful and credible observers in the developer tooling space, calls strong DM's software factory, quote, the most ambitious form of AI assisted software development that I've seen yet. The details are really worth digging into here because they reveal what it looks like to run a dark factory for software on today's agents. And as we have this discussion, I want you to keep in mind that for most of us listening, we are getting to time travel. We are seeing how a bold vision for the future can be translated into reality with today's agents and today's agentic carnuses. It is only going to get easier as we go into 2026, which is one of the reasons I think this is going to be a massive center of gravity for future agentic software development practices. We are all going to level five. So, what is strong DM do? The team is three people. Justin McCarthy, CTO, Jay Taylor and Nivan Tawan. They've been running the factory since July of last year actually. And the inflection point they identify is Claude 3.5 Sonnet, which shipped actually in the fall of 2024. That's when long horizon agentic coding started compounding correctness more than compounding errors. Give them credit for thinking ahead. Almost no one was thinking in terms of dark factories that far back. But they found that 3.5 Sonnet could sustain coherent work across sessions, long enough that the output was reliable, and it wasn't just a flash in the pan, it wasn't just demo worthy. And so they built around it. The factory runs on an open source coding agent called A tractor. The repo is just three markdown specification files and that's it. That's the agent. The specifications describe what the software should do. The agent reads them, it writes the code and it tests it. And here's where it gets really interesting. And where most people's mental model really starts to break down. Strong DM doesn't actually use traditional software tests. They use what they call scenarios. And the distinction is important. Tests typically live inside the codebase. The AI agent can read them, which means the AI agent can, intentionally or not, optimize for passing the tests rather than building correct software. It's the same problem as teaching to the test in education. You can get perfect scores and shallow understanding. Scenarios are different. Scenarios live outside the codebase. They're behavioral specifications that describe what the software should do from an external perspective, stored separately so the agent cannot see them during development. They function as a hold out set. The same concept that machine learning users use to prevent overfitting. The agent builds the software and the scenarios evaluate whether the software actually works. The agent never sees the evaluation criteria. It can't game the system. This is really a new idea in software development and I don't see it implemented very frequently yet. But it solves a problem that nobody was thinking about when all the code was written by humans. When humans write code, we don't tend to worry about the developer gaming their own test suite unless incentives are really, really skewed at that organization, and then you have bigger problems. When AI writes the code, optimizing for test passage is the default behavior unless you deliberately architect around it. And it's one of the most important differences to really understand as you start to think about AI as a code builder. Strong DM architected around that with external scenarios. The other major piece of the architecture is what strong DM calls their digital twin universe. Behavioral clones of every external service the software interacts with. A simulated, a simulated Jira, a simulated slack. Google Docs, Google Drive, Google Sheets. The AI agents develop against these digital twins, which means they can run full integration testing scenarios without ever touching real production systems, real APIs or real data. It's a complete simulated environment purpose built for autonomous software. And the output is real. CXDB, their AI context store has 16,000 lines of Rust, 9 and a half thousand lines of Go and 700 lines of TypeScript, it's shipped, it's in production, it works, it's real software and it's built by agents end to end. And then the metric that tells you how seriously they take it. They say, if you haven't spent a thousand per human engineer, your software factory has room for improvement. I think they're right. That's not a joke. A thousand dollars per engineer per day enables AI agents to run at a volume that makes the cost of compute meaningful if you are giving them a mission to build software that has real scale and real utility in production use case. And it's often still cheaper than the humans they're replacing. Let's hop over and look at what the hyperscalers are doing. The self-referential loop has taken hold at both Anthropic and OpenAI and it's stranger than the hype might make it sound. CoDex 5.3 is the first frontier AI model that was instrumental in creating itself. And that's not a metaphor. Earlier builds of CoDex would analyze training logs, would flag failing tests and might suggest fixes to training scripts. But this model shipped as a direct product of its own predecessor's coding labor. Open AI reported a 25% speed improvement and 93% fewer wasted tokens in the effort to build CoDex 5.3. And those improvements came in part from the model identifying its own efficiencies during the build process. Isn't that wild? Cloud Code is doing something similar. 90% of the code in cloud code including the tool itself was built by cloud code and that number is rapidly converging toward 100%. Borisney isn't joking when he talks about not writing code in the last few months. He's simply saying his role has shifted to specification, to direction, to judgment. And Thropic is estimating all of their company moving to entirely AI generated code about now. Everyone at Anthropic is architecting and the machines are implementing. And the downstream numbers tell the same story. When I made a video on co-work and talked about how it was written in 10 days by four engineers, what I want you to remember is it wasn't just for engineers hyper typing so that they could get that out super fast and write every line by hand. No, no, no. They were directing machines to build the code for co-work and that's why it was so fast. 4% of public commits on GitHub are now directly authored by Cloud Code, a number that Anthropic thinks will exceed 20% by the end of this year. I think they're probably right. Cloud code by itself has hit a billion dollar run rate just six months since launch. This is all real today in February of 2026. The tools are building themselves, they're improving themselves, they're enabling us to go faster at improving themselves. And that means the next generation is going to be faster and better than it would have been otherwise and we're going to keep compounding. The feedback loop on AI has closed. And the question is not whether we're going to start using AI to improve AI. The question is how fast that loop is going to accelerate and what it means for the 40 or 50 million of us around the world who currently build software for a living. This is true for vendors as much as it's true for software developers and I don't think we talk about that enough. Because the gap between what's possible at the frontier in February of 2026 and what tends to happen in practice and what vendors want to sell has never been wider. That MITR study, a randomized controlled trial by the way, not a survey, found that open source developers using AI coding tools completed their task 19% slower. We talked about that, right? The researchers controlled for task difficulty, they controlled for developer experience, they controlled even for tool familiarity and none of it mattered. AI made even experienced developers slower. Why? In a world where co-work can shift that fast, why? Because the workflow disruption outweighed the generation speed. Developers spent time evaluating AI suggestions, correcting almost right code, context switching between their own mental model and the model's output and debugging really subtle errors introduced by generated code that looked correct, but weren't. 46% of developers in broader surveys say they don't fully trust AI generated code. These guys aren't lotes, right? This is experienced engineers running into a consistent problem. The AI is fast, but it struggles with the reliability to trust without what they view as vital human review. And this irony is the J curve that adoption researchers keep identifying. When you bolt an AI coding assistant onto an existing workflow, productivity dips before it gets better. It goes down like the bottom of a J. Sometimes for a while, sometimes for months. And the dip happens because the tool changes the workflow, but the workflow has not been redesigned around the tool explicitly. And so you're kind of running a new engine on old transmission. The gears are going to grind. Most organizations are sitting in the bottom of that J curve right now and many of them are interpreting the dip as evidence that AI tools don't work, that the vendors did not tell them the truth, and that the evidence that their workflows haven't adapted is really evidence that AI is hype and not real. I think GitHub co-pilot might be the clearest illustration of this. It has 20 million users, 42% market share among AI coding tools apparently. Uh and lab studies show 55% faster code completion on isolated tasks. I'm sure that makes the people driving GitHub co-pilot happy in their slide decks, but in production the story is much more complicated. There are larger pull requests, there are higher review costs. There's more security vulnerabilities introduced by generated code and developers are wrestling with how to do it well. One senior engineer put it really sharply. Co-pilot makes writing code cheaper, but owning it more expensive. And that is actually a very common sentiment I've heard across a lot of engineers in the industry. Not just for co-pilot, but for AI generated code in general. The organizations that are seeing significant, call it 25, 30% or more productivity gains with AI are not the ones that just installed co-pilot, had a one-day seminar and called it done. They're the ones that thought carefully, went back to the whiteboard and redesigned their entire development workflow around AI capabilities. Changing how they write their specs, changing how they review their code, changing what they expect from junior versus senior engineers, changing their CICD pipelines to catch the new category of errors that AI generated code introduces. End-to-end process transformation. It's not about tool adoption. An end-to-end transformation is hard, it's sometimes is politically contentious, it's expensive, it's slow, and most companies don't have the stomach for it. Which is why most companies are stuck at the bottom of the J curve, which is why the gap between frontier teams and everyone else is not just widening, it's accelerating rapidly, because those teams on the edge that are running dark factories, they are positioned to gain the most as tools like Opus 4.6 and CoDex 5.3 enable widespread agentic powers for every software engineer on the planet. 95% of those software engineers don't know what to do with that. It's the ones that are actually operating at level four, level five, that truly get the multiplicative value of these tools. So if this is a politically contentious problem, if this is not just a tool problem, but a people problem, we need to look at the nature of our software organizations. Most software organizations were designed to facilitate people building software. Every process, every ceremony, every role, they exist because humans building software in teams need coordination structures. Stand-up meetings exist because developers working on the same codebase, they got to synchronize every single day. Sprint planning exists because humans can only hold a certain number of tasks in working memory and then they need a regular cadence to reprioritize. Code review exists because humans make mistakes that other humans can catch. QA teams exist because the people who build software, they can't evaluate it objectively. You get the idea. Every one of these structures is a response to a human limitation. And when the human is no longer the one writing the code, the structures, they're not optional, they're friction. So what does sprint planning look like when the implementation happens in hours, not weeks? What does code review look like when no human wrote the code and no human can really review the diff that AI produced in 20 minutes because it's going to produce another one in 20 more minutes? So what does a QA team do when the AI already tested against scenarios it was never shown? Strong DM's three-person team doesn't have sprints. They don't have standups, they don't have a Jira board, they write specs and they evaluate outcomes. That is it. The entire coordination layer that constitutes the operating system of a modern software organization, the layer that most engine managers spend 60% of their time maintaining, is just deleted. It does not exist. Not because it was eliminated as a cost saving measure, but because it no longer serves a purpose. This is the structural shift that's harder to see than the tech shift, and it might matter more. The question is becoming what happens to the organizational structures that were built for a world where humans write code. What happens to the engineering manager whose primary value is coordination? What happens to the scrum master, the release manager, the technical program manager whose job is to make sure a dozen teams ship on time? Look, those roles don't disappear overnight, but the center of gravity is shifting. The engineering manager's value is moving from coordinate the team building the feature to define the specification clearly enough that agents build the feature. The program manager's value is moving from track dependencies between human teams to architect the pipeline of specs that flow through the factory. The skills that matter are shifting very rapidly from coordination to articulation.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript