Thumbnail for No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer by AI Engineer

No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer

AI Engineer

20m 5s4,494 words~23 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes
[0:20]Um, our talk 12 factor agents at end engineer in June was one of the top talks of all time.
[0:20]And I know we all got the update from Igor yesterday, but they wouldn't let me change my slides.
[0:20]Uh, basically that they surveyed 100,000 developers across all company sizes and they found that most of the time you use AI for software engineering, you're doing a lot of rework, a lot of code-based churn.
[0:20]Uh, and then it doesn't really work well for complex tasks, brownfield code bases.
Use this transcript
Related transcript hubs

[0:20]Hi everybody. How y'all doing? It's exciting. I'm Dex. Uh, as they did in the great intro, I've been hacking on agents for a while. Um, our talk 12 factor agents at end engineer in June was one of the top talks of all time. Uh, I think top eight or something. One of the best ones from from AA Engineering in June. May or may not have said something about context engineering. Um, why am I here today? What am I here to talk about? Um, I want to talk about one of my favorite talks from AI Engineering in June. And I know we all got the update from Igor yesterday, but they wouldn't let me change my slides. So this is going to be about what Igor talked about in June. Uh, basically that they surveyed 100,000 developers across all company sizes and they found that most of the time you use AI for software engineering, you're doing a lot of rework, a lot of code-based churn. Uh, and then it doesn't really work well for complex tasks, brownfield code bases. Um, and you can see in the chart, basically, you are shipping a lot more, but a lot of it is just re-working the slop that you shipped last week. So, uh, and then the other side, right, was that, uh, if you're doing greenfield little Vercel dashboard something like this, then it's going to work great. Uh, if you're going to go in a ten-year-old Java code base, maybe not so much. And this matched my experience, personally, and talking to a lot of founders and great engineers, too much slop, uh, tech debt factory. This just isn't gonna work from our code base. Like maybe someday when the models get better, but that's what context engineering is all about. How can we get the most out of today's models? How do we manage our context windows? We talked about this in August. Um, I have to confess something. The first time I used Claude code, I was not impressed. It was like, okay, this is a little bit better, I get it, I like the UX. Um, but since then, we, as a team, figured something out, um, that we were actually able to get, you know, two to three x more throughput. And we were shipping so much that we had no choice but to change the way we collaborated, we rewired everything about how we build software. Uh, it was a team of three, it took eight weeks, it was really freaking hard. Uh, but now that we've solved it, we're we're never going back. This is the whole no slop thing. I think, I think we got somewhere with this. Went super viral on Hacker News in September. Uh, we have thousands of folks who have gone onto GitHub and grabbed our, you know, research plan implement prompt system. Um, so the goals here, which we kind of backed our way into, we need AI that can work well in Brownfield code bases, that can solve complex problems, no slop, right? No more slop. Uh, and we had to maintain mental alignment. I'll talk a little bit more about what that means in a minute. And of course, we want to spend, with everything, we want to spend as many tokens as possible. What we can offload meaningfully to the AI is really, really important. Um, super high leverage. So, this is advanced context engineering for coding agents. Um, I'll start with kind of like framing this. The most naive way to use a coding agent is to ask it for something and then tell it why it's wrong and re-steer it and ask and ask and ask until you run out of context or you give up or you cry. Um, we can be a little bit smarter about this. Most people discover this pretty early on in their AI like exploration. Uh, is that it might be better if you start a conversation and you're off track, that uh, you just start a new context window. You say, okay, we went down that path, let's start again, same prompt, same task, but this time, we're going to go down this path and like don't go over there, because that doesn't work. So, uh, how do you know when it's time to start over? If you see this, it's probably time to start over, right? This is what Claude says when you tell it it's screwing up. Um, so we can be even smarter about this. We can do what I call intentional compaction. Um, and this is basically whether you're on track or not, you can take uh, your existing context window and ask the agent to compress it down into a markdown file. You can review this, you can tag it, and then when the new agent starts, it gets straight to work instead of having to do all that searching and code base understanding and getting caught up. Um, what goes into compaction? Well, the question is like, what takes up space in your context window? So, um, it's looking for files, it's understanding code flow, it's editing files, it's test and build output. And if you have one of those MCPs that's dumping JSON and a bunch of UUIDs in your context window, you know, God help you. Uh, so what should we compact? I'll get more on the specifics here. But this is a really good compaction. This is exactly what we're working on, the exact files and line numbers that matter to the problem that we're solving. Um, why are we so obsessed with context? Because LLMs are pure func, I actually got roasted on YouTube for this one. They're not pure functions because they're non-deterministic, but they are stateless. And the only way to get better, better performance out of an LLM is to put better tokens in. And then you get better tokens out. And so every turn of the loop when Claude is picking the next tool or any coding agent is picking the next, and there could be hundreds of right next steps and hundreds of wrong next steps. But the only thing that influences what comes out next is what is in the conversation so far. So we're going to optimize this context window for correctness, completeness, size, and a little bit of trajectory. And the trajectory one is interesting because a lot of people say, well, the I, I told the agent to do something, and it did something wrong. So I corrected it, and I yelled at it, and then it did something wrong again, and then I yelled at it, and then the LLM is looking at this conversation, and says, okay, cool, I did something wrong, the human yelled at me, and then I did something wrong, and the human yelled at me. So the next most likely conversation token in this conversation is I better do something wrong, so the human can yell at me again. So what might be mindful of your trajectory. If you were going to invert this, the worst thing you can have is incorrect information, then missing information, and then just too much noise. Um, if you like equations, there's a dumb equation if you want to think about it this way. Um, Jeff Huntley, uh, did a lot of research on coding agents. Uh, he put it really well, just the more you use the context window, the worse the outcomes you'll get. This leads to a concept, I'm going to a very, very uh, academic concept called the dumb zone. So you have your context window. You have 168,000 tokens, roughly, some are reserved for output and compaction, this varies by model. Um, but we'll use Claude Code as an example here. Around the 40% line is where you're going to start to see some diminishing returns, depending on your task. Um, if you have too many MCPs in your coding agents, you are doing all your work in the dumb zone and you're never going to get good results. People talked about this, I'm not going to talk about that one. Your mileage may vary. 40% is like, it depends on how complex the task is, but this is kind of a good guideline. Um, so back to compaction, or as I will call it from now on, cleverly avoiding the dumb zone. Um, we can do subagents. Um, if you have a front end subagent, and a backend subagent, and a QA subagent, and a data data scientist subagent, please stop. Subagents are not for anthropomorphizing roles, they are for controlling context. And so what you can do is if you want to go find how something works in a large code base, Um, you can steer the coding agent to do this, if it supports subagents, or you can build your own subagent system, but basically, you say, hey, go find how this works. And it can fork out a new context window that is going to go do all that reading and searching and finding and reading entire files and understanding the code base, and then just return a really, really succinct message back up to the parent agent of just like, hey, the file you want is here. Parent agent can read that one file and get straight to work. And so this is really powerful if you wield these correctly, you can get good responses like this and then you can manage your context really, really well. Um, what works even better than sub agents, or like a layer on top of sub agents, is a workflow I call frequent intentional compaction. We're going to talk about research plan implement in a minute, but like, the point is you're constantly staying keeping your context window small. You're building your entire workflow around context management. So, it comes in three phases, research, plan, implement. Um, and we're going to try to stay in the smart zone the whole time. So the research is all about understanding how the system works, finding the right files, staying objective. Here's a prompt you can use to do research. Here's the output of, um, a research prompt. These are all open source, you can go grab them and play with them yourself. Um, planning you're going to outline the exact steps, you're going to include file names, lines and snippets. You could be very explicit about how we're going to test things after every change. Here's a good planning prompt. Here's one of our plans. It's got actual code snippets in it. Um, and then we're going to implement. And if you've read one of these plans, you can see very easily how the dumbest model in the world is probably not going to screw this up. Um, so we just go through and we run the plan and we keep the context slow. Here's the planning prompt. Like I said, it's the least exciting part of the process. Um, I wanted to put this into practice. So working for us, uh, I do a podcast with my buddy, uh, Vibov, who's the CEO of a company called Boundary ML. Uh, and I said, hey, I'm going to try to one shot a fix to your 300,000 line rust code base for a programming language. Um, and the whole episode goes in, it's like an hour and a half. Uh, I'm not going to talk through it right now, but we built a bunch of research, then we throw them out because they were bad, and then we made a plan, and we made a plan without research, and with research, and compared all the results, it's a fun time. Uh, by that was Monday night, by Tuesday morning, we were on the show and the CTO had like seen the PR and like didn't realize I was doing it as a bit for a podcast and basically was like, yeah, this looks good, we'll get it in the next release. He, I think he was a little confused. Um, here's the the plan, but anyways, uh, yeah, confirmed. Works in Brownfield code bases, and no slop. But I wanted to see if we could solve complex problems. So, Vibob was still a little skeptical. I sat down, we sat down for like seven hours on a Saturday, and we shipped 35,000 lines of code to Bamel. One of the PRs got merged like a week later. I will say, some of this is codeged, you know, you update your behavior, the old golden files update and stuff, but we shipped a lot of code that day. Um, he estimates it was about one to two weeks in seven hours. And, uh, so cool, we can solve complex problems. There are limits to this. I sat down with my buddy Blake. We tried to remove Hadoop dependencies from Parke Java. If you know what Parke Java is, I'm sorry. Uh, for whatever happened to you to get you to this point in your career. Uh, it did not go well. Uh, here's the plans. Here's the research. Uh, at a certain point, we threw everything out, and we actually went back to the whiteboard. We had to actually, once we had learned where we were, where all the foot guns were, we we went back to, okay, how is this actually going to fit together? Um, and this brings me to a really interesting point that uh, Jake's going to talk about later. Uh, do not outsource the thinking. AI cannot replace thinking, it can only amplify the thinking you have done, or the lack of thinking you have done. So, people ask, so Dex, this is spec driven development, right? No. Spec driven development is broken. Not the idea, but the phrase. Um, it's not well defined. This is Birgitta from Thought Works. Um, and a lot of people just say spec and they mean a more detailed prompt. Does anyone remember this picture? Does anyone know what this is from? All right, that's a deep cut. Uh, there will never be a year of agents because of semantic diffusion. Martin Fowler said this in 2006. We come up with a good term with a good definition, and then everybody gets excited. And everybody starts meaning it to mean 100 things to 100 different people, and it becomes useless. We had an agent is a person, an agent is a microservice, an agent is a chatbot, an agent is a workflow. And thank you, Simon, we're back to the beginning. An agent is just tools in a loop. Um, this is happening to Spec Driven Dev. I used to have Sean's uh, slide in the beginning of this talk, but it caused a bunch of people to focus on the wrong things. His thing of like, forget the code, it's like assembly now, and you just focus on the markdown. Very cool idea. But people say spectrum dev is writing a better prompt, product requirements document. Sometimes it's using like verifiable feedback loops and back pressure. Maybe it is treating the code like assembly, like Sean taught us. Um, but a lot of people are just using a bunch of markdown files while you're coding. Or my favorite, I just stumbled upon this last week, uh, a spec is, uh, documentation for an open source library. So if SDD is hype slop, what do I need? It's semantically diffused. Um, so I want to talk about like four things that actually work today. The tactical and practical steps that we've found working internally and with a bunch of users. Um, we do the research, we figure out how the system works. Um, remember Memento? This is the best the best movie on context engineering, as Peter says it. Guy wakes up, he does has no memory, he has to like read his own tattoos to figure out who he is and what he's up to. If you don't onboard your agents, they will make stuff up. And so if this is your team, this is very simplified for most of you. Most of you have much bigger orgs than this. But let's say you want to do some work over here. Um, one thing you could do is you could put onboarding into every repo. You put a bunch of context, here's the repo, here's how it works. This is compression of all the context in the code base that the agent can see ahead of time before actually getting to work. This is challenging because sometimes it gets too long. As your code base gets really big, you either have to make this longer or you have to leave information out. And so, as you are reading through this, you're going to read the context of this big five million line mono repo, and you're going to use all the smart zone just to learn how it works. And you're not going to be able to do any good tool calling in the dumb zone. So, that's uh, you can, You can shard this down the stack. You can do, they're just talking about progressive disclosure. You could split this up, right? You could just put a file in the root of every repo. And then like at every level, you have like additional context based on if you're working here, this is what you need to know. Uh, we don't document the files themselves because they're the source of truth. But then as your agent is working, you know, you pull in the root context, and then you pull in the sub context. We won't talk about any specific, like you could use Claude MD for this, you can use hooks for this, whatever it is. Um, but then you still have plenty of room in the smart zone because you're only pulling in what you need to know. Um, the problem with this is that it gets out of date. And so every time you ship a new feature, you need to kind of like cash invalidate and rebuild large parts of this internal documentation. And you could use a lot of AI and make it part of your process to update this. Um, I want to ask a question. Between the actual code, the function names, the comments and the documentation, does anyone want to guess what is on the Y axis of this chart? Slop? It's actually the amount of lies you can find in any one part of your code base. Um, so you could make it part of your process to update this, but you probably shouldn't, because you probably won't. What we prefer is on-demand compressed context. So, if I'm building a feature that relates to SCM providers in Jira and Linear, um, I would just give it a little bit of steering, I would say, hey, we're going over in like this like part of the code base over here. Um, and a good research uh, prompt or or slash command might take you or skill even, uh, launch a bunch of subagents to take these vertical slices through the code base and then build up a research document that is just a snapshot of the actually true based on the code itself parts of the code base that matter. We are compressing truth. Um, planning is leverage, planning is about compression of intent. Um, and plan we're going to outline the exact steps. We're going to take our research and our PRD or a bug ticket or our, whatever it is. We create a plan and we create a plan file. And so we're compacting again. And I want to pause and talk about mental alignment. Um, does anyone know what code review is for? Mental alignment. Mental alignment is it is about finding, making sure things are correct and stuff, but the most important thing is how do we keep everybody on the team on the same page about how the code base is changing and why. And I can read a thousand lines of golang every week is hard. Uh, sorry, I can't read a thousand. It's hard. I can do it. I don't want to. Um, and as our team grows, I all the code gets reviewed. We don't not read the code, but I as, you know, technical leader in the on the team, I can read the plans and I can keep up to date, and I can that's enough. I can catch some problems early and I maintain understanding of how the system is evolving. Mitchell had this really good post about how he's been putting his amp threads on his pull requests. So that you can see not just, hey, here's a wall of green text in GitHub, but here's the exact steps, here's the prompts and hey, I ran the build at the end and it passed. This takes the reviewer on a journey in a way that a GitHub PR just can't. And as you're shipping more and more in two to three times as much code, it's really on you to find ways to keep your team on the same page and show them here's the steps I did. And here's how we tested it manually. Um, your goal is leverage. So you want high confidence that the model will actually do the right thing. I can't read this plan and know what actually is going to happen and what code changes are going to happen. So we've, over time, iterated towards our plans include actual code snippets of what's going to change. So your goal is leverage. You want compression of intent and you want reliable execution. Um, and so I don't know, I have a physics background, we like to draw lines through the center of peaks and curves. Uh, as your plans get longer, reliability goes up, read reliability goes down. There's a sweet spot for you and your team and your code base, you should try to find it. Because when we review the research and the plans, if they're good, then we can get mental alignment. Um, don't outsource the thinking. I've said this before. This is not magic. There is no perfect prompt. You still have to read the plan. So we built our entire process around you, the builder are back and forth with the agent, reading the plans as they're created. And then if you need peer review, you can send it to someone, say, hey, does this plan look right? Is this the right approach? Is this the right order to look at these things? Um, Jake again wrote a really good blog post about like, the thing that makes research plan implementing valuable is you, the human in the loop, making sure it's correct. So, if you take one thing away from this talk, it should be that a bad line of code is a bad line of code. And a bad part of a plan is could be 100 bad lines of code. And a bad line of research, like a misunderstanding of how the system works, and where things are, your whole thing's going to be hosed. You're going to be telling the the model off in the wrong direction. And so when we're working internally and with users, we're constantly trying to move human effort and focus to the highest leverage parts of this pipeline. Um, don't outsource the thinking, watch out for tools that just spew out a bunch of markdown files just to make you feel good. I'm not going to name names here. Uh, sometimes this is overkill. And the way I like to think about this is like, yeah, you don't always need a full research plan implement. Sometimes you need more, sometimes you need less. If you're changing the color of a button, just talk to the agent and tell it what to do. If you're doing like a simple plan, and it's a small feature, if you're doing medium features across multiple repos, then do one research, then build a plan. Basically the hardest problem you can solve, the ceiling goes up, the more of this context engineering and compaction you're willing to do. Um, and so if you're in the top right corner, you're probably going to have to do more. And a lot of people ask me, how do I know how much context engineering to use? It takes reps. You will get it wrong. You have to get it wrong over and over and over again. Sometimes you'll go too big, sometimes you'll go too small. Pick one tool and get some reps. I recommend against min maxing across cloud and codex and all these different tools. Um, so I'm not a big acronym guy. Uh, we said spec driven dev was broken. Uh, research plan and implement, I don't think will be the steps. The important part is compaction and context engineering and staying in the smart zone. But people are calling this RPI. And there's nothing I can do about it. So just be wary. There is no perfect prompt, there is no silver bullet. Um, if you really want a hype word, you can call this harness engineering, which is part of context engineering. And it's how you integrate with the integration points on codex, cloud, cursor, whatever how you customize your code base. Um, so what's next? I think the coding agent stuff is actually going to be commoditized. People are going to learn how to do this and get better at it. And the hard part is going to be how do you adapt your team and your workflow and the SDLC to work in a world where 99% of your code is shipped by AI. Uh, and if you can't figure this out, you're host. Because there's kind of a rift growing where like staff engineers don't adopt AI because it doesn't make them that much faster. And then junior mid-level engineers use a lot because it fills in skill gaps, and then it also produces some slop, and then the senior engineers hate it more and more every week because they're claiming up slop that was shipped by Cursor the week before. Uh, this is not AI's fault. This is not the mid-level engineer's fault. Like if deep cultural change is really hard and it needs to come from the top if it's going to work. So if you're a technical leader at your company, pick one tool and get some reps. If you want to help, we are hiring, we're building an agentic IDE to help teams of all sizes speed-run the journey to 99% uh, AI generated code.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript