TubeScript Get a Transcript

Thumbnail for Building pi in a World of Slop — Mario Zechner by AI Engineer

Building pi in a World of Slop — Mario Zechner

AI Engineer

18m 11s3,410 words~18 min read

Auto-Generated

Watch on YouTube

Share

[0:15]Hey there, I'm Mario. I built Pi in a World of Slop, and this is a tragedy, a tragedy in three acts. Just to talk about this real quick, a bunch of people on the internet gave me money for ad space on my torso, and all of that goes to a charity, so, yeah, thanks, guys. So, act one, building Pi. In the beginning, there was Claude Code, and it was good, right? We all got basically catnipped by that thing and stopped sleeping. bunch of stuff before that, but Claude Code was the one thing that kind of clicked with me the most. And to preface all of this, I love the Claude team, they're brilliant people, talented, super high velocity, so, they also created the entire game, major props to them. So, this is not a roast, this is just me, an old man, telling you why I stopped using Claude Code and built my own thing. In 2025, I started using Claude Code in about April, I think, thanks to Peter, because he told us the agents are working now. And back then it was simple and predictable and fit my workflow, but eventually, the token madness got hold of them, I think, and the team got bigger and they started uh dogfooding that stuff and built a lot of features, a lot of features I don't need, which is fine, I can just ignore them. But with velocity and more features come more bugs, and that's bad because I used to work at construction sites and if my hammer breaks every day I'm getting really mad, and if my development tools break every day, I'm also getting mad. So there was this. It's just a running gag and here's Tariq telling us that Claude Code is now a game engine, and here's Mitchell from Ghosty telling us, no it's not, and eventually they fixed the flicker, but then other stuff broke and I think they're now in the third iteration of a TUI renderer. Yeah, but that's just a symptom. The real problem is that my context wasn't my context. Claude Code is the thing that controls my context, and behind my back Claude Code does things uh to the context. So you have the system prompt, which changes on every release, including the tool definitions, they would remove tools, modify tools. It's not good. They would insert system reminders in the most opportune place in your context, telling the model, here's some information, it may or may not be relevant to what you're doing. It actually says it may or may not be relevant what you're doing. And that kind of confused the model and kind of broke my workflows.

[2:34]On top of all that, there's zero observability because that's how the tool is constructed and I like knowing what my agents are doing. There's zero model choice, which is obvious, it's the native entropic uh harness, so it makes sense for them to want you to use Claude, right? And there's almost zero extensibility and some of you might have written some hooks for Claude Code, but I'm telling you the number of hooks and the depth of those hooks is very shallow. Um and every time a hook triggers, what actually happens is a new process gets spawned, basically the command you specified for that hook to be executed, and I don't find that specifically efficient. So I uh took a step back and looked around for alternatives and I'd like to especially call out AMP and Factory Droid. The Porsche and Lamborghini of coding agent harnesses, so if you can afford them, please use them, they're at the front here, they're really good and the teams are fantastic. And there's a bunch of other options and I have history in OSS, so naturally I kind of gravitated towards open code. And again, brilliant team, super high execution velocity, and they don't sell you hype, they sell you tools that work for the most part. I started looking under the hood of OpenCode uh with respect to context handling as well, because that's the most important part for me, and I found a bunch of things like, given some conditions, OpenCode Code would just uh prune tool outputs after a specific minimum amount of tokens. And that basically botomizes the model. Uh there's also LSP server support, which means every time your model is calling the edit tool, OpenCode goes to the LSP server that's connected, asks, are there any errors, and if so, injects that as part of the edit tool uh result. Which is bad because think about how you edit editing code. You're not writing a line of code, checking the errors, writing the next line, checking the errors, you don't do that. You finish your work and then you check the errors. This confuses the model. There's a bunch of other things like storing individual messages of a session in a JSON file. Each message is a JSON file on disk. Uh there was this and this happens to all of us, no no blame there, but it's not great if by default a service spins up, course headers are set in such a way that any website you open in your browser can now access your OpenCode server. That's Yeah. And entirely unrelated to all of this, I started looking into benchmarks for coding agent harnesses and found uh Terminal Bench, which is a pretty good benchmark, all things considered. And the funny part about it is that it's the most minimal kind of thing you can think of. All it gives the model is a tool to send keystrokes to to a TMUX session and read the output of that TMUX session. There's no file tools, no subagents, none of that stuff, and it's one of the best performing harnesses in the leaderboard. Here's the leaderboard from December 2025. Irrespective of model family, Terminus scores higher, mostly higher even higher than the native harness of that model. So what does that tell us? I formed two theses. We are in the fuck around and find out phase of coding agents, and their current form is not their final form, right? So, second thesis is, we need better ways to fuck around, and for me that means self-modifying, malleable agents. Things that the agent itself can modify, and I can modify, depending on my workflow. So, I stripped away all the things, built a minimal core, but made it super extensible and made it so that the agent can modify itself. with some creature comforts, it's not entirely bare bones. So, that's Pi. It's an agent that adapts to your workflow, instead of the other way around. It comes with four packages, an AI package, that's basically just an abstraction across providers and context uh handoff between providers, an agent core, uh which is just a while loop and the tool calling, uh bespoke TUI framework. I come out of game development, so I built a thing that actually doesn't flicker too much, and the coding agent itself. Here's Pi's system prompt. That's it. Eventually the industry created a new standard called skills, which is basically just markdown files, so we added that as well and that needs to go into system prompt. So, begrudgingly, we had to add a couple more lines. And finally, here's the magic that makes Pi able to modify itself. We ship the documentation which was handcrafted by me and an agent, um and code examples of extensions, and all we need to do for the agent to modify itself is tell it, here's the documentation, here's some code that shows you how to modify yourself by writing extensions. It comes with four tools, that's all it has, read, write, edit, bash. Here's the tool definitions. Don't read the text, just look at the size. That's it. Here's what happens when you start a new session in one of these tools. So the thing is, the models are actually reinforcement trained up to va zoo, so they know what a coaching agent is, because a coaching agent harness is basically what they're being trained when they are post-trained. You don't need 10,000 tokens to tell them you're a coaching agent, they know, because they are coaching agents now. Pi is also YOLO by default, because my security needs are different than yours, and I don't think a little dialogue that pops up every now, every time you call Bash, asking you to approve is a smart security uh uh mechanism. So instead, I give you so much rope that you can build anything that's fit for your specific security needs. There's also stuff that's not built in. I'm a Heathen, because this is how I do it. But if you don't like that, then you just ask Pi to build you sub-agent support or plan mode or MCP support, whatever you need. Extensibility comes with a bunch of table stakes, and then with the extensions itself, and extensions and Pi just TypeScript modules. In the simplest case, a TypeScript file on disk, you point Pi at that. Here's an extension, load that as part of the harness. And with that, you get a basically an extension API that lets you hook into everything and define stuff for the harness to expose to the to the model. And that includes tools, uh slash command shortcuts, you can listen in on any kind of event and react, and then save state in the session that's optionally provided to the agent as well or stored there for tools that analyze sessions as part of your organizational workflows. You can do custom compaction, custom providers, and you have full control over the two-way, so you can modify everything in Pi. And you can then bundle all of that up and put it on MPM or at on GitHub, because I think we don't need to reinvent another bunch of silos called marketplaces. We already have package managers. And all of that hot reloads. So if you develop an extension for Pi, you do so in this session, and you hot reload the changes and see the the effects of that immediately, which is very great and it's also game development thingies. In game development, you want high very low iteration uh speeds, and that's great. So, a couple of examples. Claude or Anthropic ships the slash, by the way, which lets you talk to the agent while it goes on its main quest. I posted this little prompt on Twitter jokingly, and somebody built it in five minutes. with more features, and it didn't have to fork or clone Pi, they just let the agent write the extension based on the prompt. Here's Nico, he's one of the most prolific uh extension writers. I don't know what the fuck is going on here, it's a chat room for all of his Pi agents and they talk with each other, I would never use this, but all of this is custom, including the UI. Or you can play NES games, or you can play Doom. And there's a bunch of other examples, I'm not going to talk about. So how do you build a Pi extension? You don't. You tell Pi to build it for you based on your specifications, and then you just iterate with it on that and hot reload during the session. I'm going to skip that example as well. And if you don't like building things yourself, and I hope you do like building things yourself, but if you don't, you can look on NPM or our little search uh interface on top of NPM to find packages for sub-agents, MCP and so on. So does it actually work? Well, here's the terminal bench leaderboard from October before Pi had compaction, I added that for Peter's Claw thingy, it scored six place. But none of this is actually about Pi. If you want to retake I I basically want you to retake control of your tools and workflows. So build your own. Um and if you want to know more about Pi and Open Claw, go to this talk, please. Yeah, and then eventually Peter happened, he put Pi inside of Open Claw, so it's a TUI core, which meant my open source project became the target of a lot of open Claw instances unbeknownst to their users. So this is Act 2, OSS in the Age of Clankers. Clankers are destroying OSS. Here's still draw, they closed down the issue and pull request tracker. Here's Open Claw's uh trackers. Here's mine. Half of that is open Claw instances who post garbage, so I started to rage against the clankers. Um if you send a pull request, it gets autoclosed with a comment that asks you to please write a nice issue in your human voice no longer than a screen worth of text, and if I see that, I write looks good to me, and your account name gets put in a file in the repository, and the next time you send a pull request, it's let through. Clankers don't read that comment. They don't go back once they posted a pull request. So that's a perfect filter. Mitchell eventually turned that into vouch. Here's a clanker. I also labeled them. If you had interactions with Open Claw, your issues get deprioritized. I also build tools where I embed uh issues and pull request texts into 3D space, so I see clusters of issues. I also invented OSS vacation. I just closed the tracker whenever I want, so I have my life back. So, does this work? Yes, sort of. Which leads me to act three, slow the fuck down. Everything's broken. And then there's people that say, our products been 100% built by agents. Yes, we know, it fucking sucks now. Congratulations.

[12:23]And I'm hearing this from my peers and this is entirely unhealthy. Um so here's how we should not work with agents and why, at least in my opinion. I wrote this on my blog a while ago, but the basic gist is this. We are having Armory of Agents and you're using beads and you don't know that it's basically uninstable malware, and Anthropic build a C compiler, it kind of works, but actually doesn't, and we're hoping the next generation of molds will fix it, and here is Cursor building a browser and that's also super fucking broken, but the next generation will fix it. Pinky promise! And SaaS is dead, software is solved in six months, and my grandma just built herself a Spotify with her Open Claw. Come on, people. So agents are actually compounding booboos, which is my word for errors, with zero learning and no bottlenecks and uh delayed pain. The delayed pain is for you. Here's your code base, on a human, on one agent, and 10 agents. How much of the agent code can you review? Here's the same code base but expressed in number of booboos per day. How much of those booboos do you think you'll find? Then you say, oh, I have a review agent. Let me introduce you to the wonderful world of the Ouroboros. Doesn't work. It catches some issues. Um the problem is that agents are merchants of learned complexity. Where did they learn that complexity from? From the internet. What's on the internet? All our old garbage code. There are some pearls on the internet, really well-designed systems, but 90% of code on the internet is our old garbage, and that's what the models learned from. And every decision of an agent is local, especially if the code base is so big that it doesn't fit into its context, and if you let it go wild and add abstractions everywhere that are intertwined. Um so that leads to abstractions and duplication, and backwards compatibility. Who has seen that in the output of their agent? It's fucking annoying, or defense in depth. So, yeah. You get enterprise-grade complexity within two weeks, with just two humans and 10 agents. Congratulations. And then you say, but my detailed spec... Yes, sure, you know what we call a sufficiently detailed spec? It's a program. So if you leave blanks in your spec, what do you think happens? How does the model fill in the blanks, and with what does it fill that in? It fills it in with the garbage that it learned on the internet from our old code, which is garbage to mediocre. And then you say, but humans also. Yes, humans are horrible, failing beings, but they can learn. And they are bottlenecks. There's only so many booboos they can add to your code base on a daily basis. And humans feel pain, which is a very interesting property because humans hate pain. And once there's too much pain, the human has a bunch of options. It can quit their job, it can uh blame somebody else and make them fix it, or everybody bands together and starts refactoring the shit out of the garbage code base, right? Agents will happily keep shitting into your code base. And no, your AGENTS.md and super complex memory systems will not save you. Agents don't learn the way we learn. Those are my most beloved people. I don't even read the code anymore. Congratulations. Something is broken, and your users are screaming, so who you going to call? Not yourself, because you haven't read the code. So you're relying on your agents, but they are now also overwhelmed because the code base is so humongous that there's absolutely zero chance they can get all the context they need to fix the issues. And long context windows are a hack as most of you will find out this year, as everybody's switching to one million uh tokens context windows, and agenting search is also failing. So the agent patches locally and fucks shit up globally. If you see this in your code base, you're fucked. So you cannot trust your code base anymore and also not your test because your agent wrote your tests, so good game. So here's how I think we should work. Um there's a bunch of properties for good agent tasks. That means scope. If you can scope it in such a way that the agent is guaranteed to find all the things it needs to find to do a good job, you're done. That means modularize your code base. If you can give it a function to evaluate how well it did the job, even better. Hill climbing, auto research. Uh anything non-mission critical, let it wipe. Boring stuff, let it wipe. Reproduction cases for user issues, which are usually only partial in information, perfect. I don't spend any morenings anymore doing that. Or if you don't have a human near you, rubber duck. So lots of tasks you can use them for and save time. At the end of that, you evaluate, you take what's reasonable, most of it isn't, and then finalize. My final slide, more or less. Slow the fuck down. Think about what you're building and why and don't just build because your agent can do it now. That's stupid. Uh learn to say no. This is your most valuable uh capability at the moment. Fewer features, but the ones that matter, and then use your agents to polish the shit out of that and lighten your users, not your uh token maximum desires. Cap the amount of generated code uh that you need to review. And non-critical code, sure, wipe stop ahead. Critical code, read every fucking line. See the keynote after me for more info on that. So how do you know what's critical? Any guesses? Well, you read the fucking code. Uh if you do anything important, write it by hand. You can use a clanker to help you with that, but don't let it make the decisions for you, because we've learned all the decisions it makes are learned from the internet. And that friction is the thing that builds the understanding of the system in your head, which is important. And it's also where you learn new things. And all of this requires discipline and agency, and all of this still requires humans. Thank you.

MORE TRANSCRIPTS

Thumbnail for تعلم اللغة الانجليزية كما يتعلمها الغرب : كورس شامل لتعلم اللغة الانجليزية من الصفر 3 by ZAmericanEnglish

تعلم اللغة الانجليزية كما يتعلمها الغرب : كورس شامل لتعلم اللغة الانجليزية من الصفر 3

ZAmericanEnglish

Thumbnail for Drafting a Basic Bodice Block - Pattern Drafting and Measurement for Beginners by Ebby Pattern Drafts

Drafting a Basic Bodice Block - Pattern Drafting and Measurement for Beginners

Ebby Pattern Drafts

Thumbnail for УЧИТЬ? by DAOleynikov

УЧИТЬ?

DAOleynikov

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript