Thumbnail for The Friction is Your Judgment — Armin Ronacher & Cristina Poncela Cubeiro, Earendil by AI Engineer

The Friction is Your Judgment — Armin Ronacher & Cristina Poncela Cubeiro, Earendil

AI Engineer

26m 28s3,543 words~18 min read
Auto-Generated

[0:15]Morning, thanks for having us. Um, today I want to talk with Christina about friction a little bit. Um, this is um, a social preview that came up automatically when someone submitted an issue, um, to basically a this is a forum post that goes with um a security incident that was deployed accidentally, it was a configuration change that caused the problem. And the social preview post had the marketing tagline of that company, which said, ship without friction. Um, and we want to encourage to add a little bit of friction to it. Um, and I want to tell you why. So, who are we? Um, I've been doing software development for 20 years, most of it in the open source space. Um, I have created Flask, which is a Python framework, which ironically, is so much in the weights that a lot of people um, are learning about it now because the machines are producing it. Um, and I left my previous company that I worked for Sentry in April last year, which perfectly coincided with um, me having time and then obviously cloud code. And so I fell deep into a hole of uh engineering, and I started writing on my blog and and and a lot of people reached out to me over the last year, um, being all excited about this. Um, and then I started with a friend in October a company called Eal, where we are trying to make sense of all the AI things. Yeah, and my name is Cristina and I work with Arman at this company called Eal. But importantly, I am what I like to call a native AI engineer, and what that basically means is that these tools have been around longer than I have. Um, so what this means is like they've been super foundational in how I've become a software engineer, not just because obviously I use them to work, but also because this is the means by which I've learned to do what I do. And before Eal, I was working at Bending Spoons. So we want to share a little bit from practice, not just theory, but um, I will readily admit that I don't think we have all the solutions. So we have been building with or on agents for a good 12 months. Um, we had huge leverage and great disappointment. And we, we really keep running into two types of problems. Um, I I think especially if you listen to some earlier talks at at this conference, you will have learned a lot about um that you should keep using your brain. Um, it's for some reason, it's really, really hard. There's just one more prompt and and we don't sleep that much, like what is it that actually makes it so hard?

[3:04]And then would it be that hard if the machines would actually be writing perfect code and we wouldn't have to think quite as much and like, what is it? Is there something we can do to make this a little bit better?

[3:30]So I'll begin by introducing the first part of these problems, the psychology problem. And what I want to talk first about is the shift. So I'm sure a lot of us here who have been playing with these tools for a while now, experienced this at some point. We were prompting, prompting, not so good, and then at some point, suddenly, it clicked, and they were really, really useful for us. And it was fun in the beginning, and they gave us a lot of extra time, right? Because not everyone was using them. They were actually tools that made us more productive, that made it more fun to do our jobs. But very quickly, because they were so useful, and they got us so hooked, everyone was using them. And so this kind of had the opposite effect, where suddenly the baseline expectation was just that everyone is now using them, and you have to use them. And so this, this fun and free time translated into pressure. Now we all have to ship faster and produce more code, and it is just not sustainable to review and to actually have time to think.

[4:51]And so, this leads us to the trap, and I actually think there's two parts of this problem of this trap. And one of them, a lot of engineers have spoken about, and it's that these tools are super addictive. You never know if that next prompt is going to be the one that makes your product work and you've added a new feature, or if it's going to be that last drop of slop that brings your product crashing down. And so it's very addictive, we keep doing what we're doing. It's not a great solution, but also, most importantly, and I don't think we realize this as much, is that because we produce a lot of output very fast, we are tricked into thinking that we're actually being more efficient, doing more work. And this is quite the opposite because now we don't have as much time to actually stop and think and design what we're doing, ask ourselves, is this the best way in which I can implement this, or could I be doing something better? And when you're in this flow, it's very difficult for yourself to stop, and it's definitely very difficult for your agent to stop because it's running around, and it's reading files that it should have never even read. So we are the ones that need to actually have the agency to being controlled here.

[6:26]And one thing that from a if you start scaling this from like one person to an engineering team, that actually took me quite a while to realize is that it really changes the composition of the engineering team. We, we were really supply constrained by creation of code and so like the balance between writing code and reviewing code in engineering teams was usually quite decent. Now, every engineer has a multitude of producing power compared to their reviewing power, and so obviously we are piling up on pull requests. But we are also slowly starting to extend the total amount of humans in an organization that are participating in engineering process. I talked to a lot of engineers over the last year, um, and increasingly there one of the things that came up is like, now I have marketing people shipping code. I have um, former CEOs ship, sorry, CEOs that used to be like engineers now shipping code again. And so the the roles that those people have in the companies also doesn't give them there's not that much um, the responsibility doesn't rest in them. The responsibility still rests with the engineering team. And so the the total number of entities both humans and machines that are participating in code creation process outnumbers the ones that can carry responsibility. We're not there where the machine can be responsible for the code changes. And so that has led to more and more code reviews being skipped, being rubber stamped, um, and on the goal to small PRs that that we want to see again so that this reviewing process goes. Um, this amplification is something that at the very least we need to recognize.

[8:40]And so when you get this pull request that looks really daunting and has 5,000 lines of code in it, this is actually when you should be thinking, and that's exactly when it's the most overwhelming, and increasingly we're tapping out of this.

[9:00]On the engineering side, what we're doing is, we're creating larger pull requests. We're creating these massive changes because it is free now, right? And the, if you think about how the agents work, they're really optimized to creating code that runs. Like their main objective is, write some code, run the tests, make some progress. The reinforcement learning sort of gets this in. And so the, the agents are writing kind of code that is, is when you as a human, as a software engineer, start learning how to write code, you wouldn't necessarily write. So, for instance, you see quite a bit of code that tries to read a config file and if it doesn't read the config file loads some defaults. And as an engineer you know, that's actually not great because I might not notice that I'm reading the default config file. And so I might only discover that I have a massive problem after two hours when I already wrote database records with wrong data. And so these machines, they, they optimize towards making progress, towards shipping stuff, to like unblocking themselves. And as a result, they're creating many more failure conditions than human written code normally would do. In parts because you as a human feel a little bit of a, you feel bad when you write code like this. There's there's something that sort of builds up emotional in yourself. But the agent doesn't have a reason for this. It, it doesn't feel anything. And so, if you, if you create these services that are sort of hobbling along and they're actually willing to, to recover from local failures, you actually create very, very brittle systems. And this also means that you're very quickly creating a codebase of the size and complexity that the agent itself can no longer dig itself out from. It's going to start no longer reading all the files that it should. It's it's creating code in a new file that has already done somewhere else. And so this, this entire machinery over time creates much more entropy in the source code than you would normally have if if humans were on it. And and a big part of this is that humans feel bad and agents don't really have any emotions that they communicate to you.

[12:07]But as Arman likes to say, don't worry, not all is lost. We have found some correlation between what the agents really excel at doing and the types of code bases that we actually put them to work in. And for example, the main example here is libraries versus products. What we found is that for libraries, they tend to excel a lot more, and this makes sense because intrinsically, when you're building a library, you tend to have a very clearly defined problem that you're trying to solve. And most of the time you can even map the set of features that you want to build to the API service, and it has very tight constraints. And because this is something that you probably want to build on top of or make accessible to other people, it's likely that it's going to be a very simple core in which you can then plug into. And on the other hand, products, and perhaps this is a bit more unlucky for the rest of us because we all probably are more into building products, uh, it's much harder. Because there are so many interacting concerns and components, like, for example, you have your UI, your API response. You have different permissions depending on the feature flags, the billing, and so on, and so there's this very heavy intertwining between different components. And what this means is that for the agent itself, it's impossible to feed to fit all of this into its context window. It's has no way to actually understand the entire global structure, and so locally, the agent tends to be very reasonable. But when it gets to the global scale, it becomes a bit demented.

[14:24]So what we're proposing here is that just as you would do with any type of system design in the past, your codebase has now become infrastructure. And as such, you have to design it in a way so that it is also legible for the agent and it can make the most of it. And so, this is what we're proposing is an agent legible codebase. And one of the main points that is very clear to all of us, I'm sure is modularization, so like we have different components. And this makes it easy for the agent to add one feature in one spot without corrupting everything else. But importantly, this also means modularizing your code flow itself. So for example, I've been working on some refactoring where building somewhat of an AI assistant. And for me it was super important to understand which steps of my code are actually like the main points. So say like you get user message, then I pass the message to the agent's loop, and then I have to deal with the output. And this is where these points are very clearly defined for me, so the code was not as messy. But it happens to be that between these points, between these steps, that's where the agent tends to add the most fuzz. So it will be parsing between different types, it's adding things to state that shouldn't be in state. And so you end up with these behaviors that you didn't want to support, and that are unexpected and can be quite dangerous. Another point is trying to follow all of the known patterns because I think we all know by now. There's no point in fighting the RL, the reinforcement learning. The more we can lean into it, the better that our output is going to be, and it's also more scalable down the line. Then as mentioned with libraries, like if you have a simple core, and you push the complexity to other abstraction layers, then it's going to be easier for yourself and the agent to be able to read your code base.

[17:17]And no hidden magic. So, for example, here, uh, using React server actions, or using ORM instead of raw SQL, what this does is that it hides intent from the agent. And if the agent can't see something, it can truly not respect it. And so, to be more precise, these are the examples of mechanical enforcement that we have been using at the company. And most of these, we actually achieve with uh linting rules. So the main example would be no bare catch alls. Great. Imagine that there's an example here. The agent found a bear catch all, and it was like, oh no, this is bad, edited it. But yeah, so we also try to have our SQL uh, always in one query interface so that the agent doesn't have to go hunting around the code base, finding all of the different places. Because if it misses one, then you can have breaking behaviors and again, that's dangerous. We try to have one primitives components library for the UI and not have any raw, for example, input input boxes, uh, so that it's we always have one type of styling. It's very consistent, one kind of behavior. We don't have any dynamic imports, and this may not sound as important, but actually we enforce unique function names. And the reason for this is not just more legibility for you and the agent, but it's actually also the token efficiency. So if your agent is grapping for a specific feature or something in your code base, if it only gets one output, it's going to be much better at continuing with the loop. And we've started exploring something recently called erasable syntax only TypeScript mode. And what this does is that your code is basically JavaScript, and it has the type annotations on top. And this means that there's no transpilation direction because there's one source of truth between your actual code and the compiler. And so when the agent is looking for errors, it doesn't have to have this like confusion of, oh my God, what am I looking at, it is much better at finding them. And so, the goal really is get in this loop somehow, like get the agent to produce as good code as it can, but you really need to find a way to feel the pain that the agent doesn't feel and you need to be woken up in a way when you should be looking at this. And one of the things we have been doing is to build a Pi extension for our review needs, where we are separating out the kind of input that normally would go back to the agent. So this is mechanical bugs, it is where it clearly violated Agemd. Um, but then we specifically call out the kind of changes where the human's brain should reactivate. Right? It's like, we don't think that a database migration should ever go in with other human making a judgment call on this, because it's very much depends on the logs, the size of the data in production. Um, if there are permissioning changes, you better think about this themselves rather than the agent because they can be they can be under documented. There's just some examples where we learned, if we miss it, we regret it. Um, and you will miss it, but at least these machines can help you find this and then you see this and then you actually get a little bit of a hit like, oh now, now I have to kick in the gear and do something here. Um, this is what this looks like in Pi. Um, you have the, um, on the bottom, you have the human call outs. On the top, you have what is going but basically if you were to end this review and select fix the issues, the agent would go back and automatically act on the first two. Um, but but this is the moment where I will now go and see like, is this a dependency I actually want to have in this code base? Like, do I like the maintainers? Is this, does this work for me? And we all obviously like the speed, like this is addictive, it is great, we feel there's a lot of productivity, but it is so devious if you start relying on its speed where you really shouldn't. And so I can only encourage you to find the areas where you you have this feeling that this is actually not positive. For me a lot of this is reproduction cases, like when a customer reports an issue, I can, I can have the agent reproduce this perfectly and I have a really good starting point. Exploring different type of product directions, for as long as I don't commit yourself to doing this, uh, with the code that it generates. Um, all of this is great, but on the other hand, system architecture, creating reliability in a system, they're not just very good at. Because we really still have to go slow. It's there is so much mess that can appear in a code base in so little time. Mario was already talking about this earlier, but like we forget that we are producing months and months of technical debt in the in the in the time of weeks, in the time of days sometimes. And it becomes so much harder to actually understand what's going on in the code base.

[24:00]The when the understanding of your own code drops, it is really, really hard. And it's also psychologically hard. I've found some code pieces that actually didn't work in production, and I was kind of frustrated learning that I was the one that committed with the agent and just didn't really see that. It's it's a very disappointing experience when it happens, and then you realize that you actually were the one that screwed up. Um, and so it is, it is psychologically incredibly hard to, to really judge objectively the state of the code base. And the only way right now is to really slow down a little bit on on that front. And this, this friction, I know that friction, like every engineering team I've ever worked at said, like, we need to get rid of the friction in shipping, and that is true. Like, there's a lot of stuff, there's very, very annoying and shouldn't be there, but if you have worked on large enough engineering work, as elos are great system that is intentionally designed to put friction into the engineering process to make you think. Do I need this reliability? Do I need this criticality of the service? Am I sufficiently staffed to run it? And with the agents we have now gotten into this idea that we should get rid of all of this, when in our reality, we need of it. Um, because the friction actually in many ways is what's necessary on a physical level to steer. Like without friction, there's no steering and and that is really necessary. Um, so you should, you should, you put a little bit more of a positive association to this idea of friction, um, because this is really where the judgment is. This is where experience is, and you should be inserting that and start feeling it. Thank you.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript