Thumbnail for What even is an AI Agent?! (The Standup) by The PrimeTime

What even is an AI Agent?! (The Standup)

The PrimeTime

23m 36s4,535 words~23 min read
Auto-Generated

[0:00]Guys, have you guys checked out the hottest new site balls.yoga? What? Yeah, I I have checked it out. Adam, check it out. balls.yoga. Check it out. Oh my God, what is this? It's so hot right now, so hot. This is you guys. What in the world? I just sent in the chat the we we made a ad for Bolt for their hackathon this month that we were doing. Oh, and it involved Prime being devastated that he was no longer able to make balls.yoga, which we just casually mentioned in the ad, but then we actually made it. You actually made it. So that there would be an Easter egg. Can you believe balls.yoga was available to buy though? Adam, are you okay? He's watching and he's watching the ad. Yeah. I uh no. Your face looks so disappointed. Oh, did it? I was just reading Twitter. I'm sorry, you guys sent me to Twitter and then I found that there's some there's some stuff on Twitter. Ooh, the drama. Drama. I just want to read it so badly now, like Let's just read it on stream. I don't know if it belongs on your podcast. Like I don't think that's necessarily the right. We're going to keep on, can we give a proper intro to the episode before we start? Yeah, yeah. Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah. Uh, anyways, I'm sorry. Go ahead, Prime. Hey, today we are having on what I would consider two experts, Adam and Dax. Uh Adam is currently looking over looking very disturbed. Sorry. Josh, zoom in. Zoom in, Josh. Zoom in. But they have been building out something called Open Code, which is going to be an agent for your terminal in the experience. Looks immaculate. Adam, you've done a fantastic job. Dax, I don't know what you do. But it looks very, very, very good. And of course, on the podcast, as always, is T. T. Hey, he might be Steve Jobs. uh Telescopic Johnson. I don't know what your full name is these days. You keep getting new ones.

[2:18]That's how we started our company. That's actually why we started our company, unironically, by the way. It's the first time someone got hired cuz of Neovim. You got two, double, two for one. But like I mean, the the reality is that this type of agent is built for people who want I assume the more command line kind of experience, the ability to have the faster uh more creamy experience that way. And so they're going to just kind of walk us through what it takes to actually build an agent. But not only that, some things about open code itself, which uh hopefully you'll see a bunch of this on the stream coming up because this is really our kind of like our area that we all seem to enjoy for whatever reason, all four people on this podcast use Neovim, by the way. And so you can definitely And it's so That's actually why we started our company, unironically. It's the first time someone got hired because of them. I don't know if it belongs on your podcast. Like I don't think that's necessarily the right. Anyway, so that that that's been a little frustrating, but on the other hand, we're really excited about the direction things are going in. I think we've been, I think the genesis of this is like there's been a lot of really cool tools out there, like cursor, a lot of people like, a lot of people use a bunch. But I don't want to give up Neovim. Like I like working in the terminal. I don't want to switch my whole IDE. Uh so we're excited about seeing, okay, what can we do for people like us? Like what what's the best possible thing we can put together that's complementary to whatever IDE that you choose to use. Nice. Okay, well that's some good background info because we're going to be mentioning open code a bunch probably while we're talking about this, so that clears up uh what where people should go when you guys are talking about that at least. So let's get to what is an agent, Dax? Everyone's been asking. I got to be honest, I don't really know. Adam, do you want to take this? What what's your concept of it? What's your concept of an agent? Uh well, I I think the term is very overloaded, right? So outside of programming, it means a lot of things, probably. Outside of programming? Yeah, I meant within the AI bubble. Uh So I guess there's like, in the programming world, it's just this idea of calling an LLM, giving it a bunch of tools that are related to programming. So editing, reading files in your code base. Uh and then just like looping and letting it keep calling these tools, uh making changes and then coming back with some kind of response. That's that's kind of the basics, like the formula for an agent. Uh if you look at like cursor has an agent mode, uh but then there's other tools in the term like Cloud Code. I think all the tools that fit into this category kind of have that that backbone. And the the in particular the looping is really what we like, the looping and the tools, right? I I think like we can say agent equals LLM plus tools. Is that fine? Plus loops? Mhm. Plus some sort of system prompt that keeps it into like there's some I assume there's some sort of secret sauce to getting a good prompt that helps the agent understand that it needs to go through a bunch of files and kind of understand the code base. Is that what it is also? Or understand the task at hand. Yeah, and in each of the models kind of like there's they respond differently to the system prompts. So there is like a bit of that, but it's not like a super sensitive secret.

[5:51]I mean, there's you can you can kind of see all the system prompts out there. Uh I don't really know that any tool can like differentiate that hard on those fronts. Like the tools are pretty like set to the model at this point. Cloud expects certain tools. At least today, maybe someday uh all the models will be better at calling tools generically, but right now like it's really kind of like all these all these agents have the same set of tools, the same roughly system prompts. Uh it's all the stuff around it that I think we're focused on, like open code having kind of like uh the share page. You can share sessions with people and you can kind of see the full history of the session. Um and just having a really great experience. We're going to build a mobile client so you can kind of like step away from your machine and continue, you know, conversations while you're on the toilet or whatever. Uh yeah, I think like there's a lot of stuff around the actual agent that makes for a good experience in terms of programming with these things. And I think Dax and I have both done a lot of programming with these things. So we have a lot of opinions. Yeah, I I think we're definitely like really firmly in the product zone. It's like less about pushing the LLM capabilities or like finding clever ways to use the LLM itself or making it perform better. There's some of that, but most of it is just packaging it in a way that's that's nice to use. Um that's obviously very fun because when you're working on a product that you yourself use every day, you're just like, oh, I wish this was like this. I wish this was this. I wish this was easier. So, yeah, it's nice to be able to iterate on that. Um and then some of the mobile stuff I so I think one differentiator with us is um we're really in on this idea of the agent running on your machine with access to your stuff. Because that's where you presumably your dev environment works the best. Presumably you have everything set up. Some of these uh tools like codex or like Devon, they work remotely and they run in the cloud, which can work, but then you need to like recreate your perfect environment in the cloud, which some companies are disciplined and have that, but not everyone does. Next fixes this, right? Yes, Nix does fix it, but you know, how many people use Nix? Uh So until everyone's using Nix. So until everyone's using Nix. Yeah. Realistically, practically, most people's workable dev environment is local. So we want to do stuff like, oh, we want to have a mobile app. That's just going to be connecting to a session that's you leave your laptop running and you step away, go on a walk, you can like get notifications saying, how are the agents done with this? Give it feedback, but it's also just running on your laptop. So you don't have to go set up this crazy cloud thing. Have you have you successfully been able to actually, you know, do a day of prompting while out at the grocery store doing your own thing, just hitting some prompts from your mobile phone? No, we haven't built the mobile client yet. Uh it's something I desperately want. I don't care if anybody else uses it, to be honest. Uh there's just so many times at this season of life. I'm 38 years old. I have two kids. I'm out of my office a lot. Like I step out 15 times a day. And to be able to like go on a walk and then when it's done, get a notification, you know, it's doing its loop, uh and it needs input to be able to like continue that conversation without having to be setting up my desk. It's just I want it very badly. And it I think it turns out other people want it. Uh we've seen some mentions on Twitter. Cursor's even designing a mobile app. So uh yeah, it doesn't exist yet, but I'm very excited to have it. Okay. So then kind of walk us through what it takes to actually build an agent in some sort of real capacity, not just a weekend warrior project, but like what what like why did it take you guys so long to release yours? Obviously, you put a lot of work into it, so it must be more than just a weekend project. Yeah, so I I think one dynamic for us is we're not coupled to a single AI provider. So we support Anthropic, Google, Open AI. We actually support like the full list of agents available because we're built on uh a library that covers most of them. What about like llama? Can I run it locally? Yeah, so one of the I haven't fully tested this out yet, but one of the providers, there's like there's like provider support. One of the providers is for like a local model. Uh so one just testing this stuff across everything.

[10:13]Well, some of this aren't poor, and we use Opus, but sure. You can use Sonnet, that's fine. I use Sonnet because I'm not paying a all time.

[10:23]He's like, I need a mobile app because I'm going to be doing this while I'm eating vegan sushi. It's literally five times as expensive, I think, so. Yeah. It's it's quite a difference. I like it more because it's expensive. Like that's my thing. I'm just That is a very that is a very Adam thing. Yeah. It's like a Gucci bag or something, right? Like I I get the same code out. It doesn't matter at all. The models are all the same, but mine was made with Opus. Um, I do have a quick question, Mr. uh Gucci over there. I noticed that you're not subbed on my channel and so it's just like you're talking about spending all this money, but you don't even have five a month. A hours of Hang on. Hang on. How do I do it? How do I do it? I was at one time. I haven't been on Twitch in months. I swear to you. Oh, so before it wasn't even scheduled because you can set it up to auto renew, Adam. I I think it was. It was at one time. Maybe it's a new I got a new card, a new credit card. Do you know that happens and then it just bounces. It's crazy you guys make Adam do all the work and you make him pay you. He's getting the privilege of working with us, Dax.

[11:32]It's funny, I did see an ad a second ago. I've got Twitch chat up here. I was like, why am I seeing an ad? What is this on Twitch? That's why.

[11:42]I'm so I'm so tilted right now. I just I'm sorry. It's so hard to focus. I've seen like in Twitch chat, people are confused about the open code AI open code. It is super confusing. I'm so upset about this. It's so like frustrating and annoying. And I want to just like focus on our conversation, but it's just very annoying. Do you need a Snickers?

[12:03]I don't think that's vegan friendly, Teej. I know that's what he needs though.

[12:10]Um okay, well, all right, so I mean, I want to keep going on this agent thing. So you're just talking about integration and tools all this. Uh I would assume that do you have to like do you have to like bespokely craft every prompt to be able to use every tool or does it just simply kind of work out of the box? You're just like, hey, you should search for this thing now. Or does it tell it like, how does it use reasoning in the sense that you're like, hey, what do you do? And then it's like, this is what I should do. And then you're like, okay, now follow step one. What do you do with follow step one? Like, how how does it start using these tools? So roughly what the model is good at is you give it a description of the tools and a list of tools and you give it a task and it's very good at going in a loop. It's like, I'm going to figure out what to do. I'm going to tell you what tools to call. Call them, give them the results. I'm going to like do that. So it's at least the models where this works, uh that does work really well. There is optimization though. There always is. So the task tool descriptions, like the description of each task tool, like how do you, how do you describe it? Um like what's the schema for the input? Like is that confusing? Does it get tripped up on things here and there? So what we do is for say something like Anthropic, uh, you know, they have their own version of this called Cloud Code. We just dump everything out from Cloud Code and when you're using Anthropic, we use all the exact same task or the tool descriptions and stuff. Um so on one hand, I'll say it does make a difference. If you just do a very naive approach, you're probably going to get worse results. I personally don't think it's like the thing that makes something 10x better. I think it's a thing that you can like kind of play with and optimize, but, uh, and also, the end goal is like, you don't really want it to have to be that precise because you can bring your own tools. Uh, you can say, here's a tool to access my database, here's a tool to do X, Y, Z things. It needs to be able to be flexible. Uh, so we think it'll just kind of get better along those lines anyway. Can you can you give an example of like how you hooked it up to like LSP for example? Because it feel like that will be like kind of illuminating with like a particular example of how that how it gets access to that. And how did it even knows to call that also? Yep, so uh with LSP, so we experimented with a few different approaches to this, but the one that we're sticking with for now is whenever it makes an edit to the file, the response, we have a tool that's like edit file tool. Send us like a patch or like a it's like old string and new string in the file, we'll replace it. That's how it edits files. When it does that, we the response to that tool will include any diagnostics that we found from LSP. So if it's a type script file, we'll hit the TypeScript LSP running, we know that after this edit, there's these three errors, we say, here are the errors, please fix, and you'll see it instantly respond with with a fix. And this really helps hallucinations because when it like thinks that there's functions that don't exist on our library or any of those things, uh it corrects itself right away and it's it's quite good at responding to that feedback. So wait, so how how does it how does it kick all this stuff off? Because does that mean you actually run the LSPs yourselves or do you okay, so that means whenever I open a project, if I have open code plus Vim open, I may have two TypeScript servers. I may have two go pleases or whatever they're called and all the other ones. Okay.

[15:13]Yeah, so we originally were looking to see, can we like hijack any running ones? Um because maybe you already have them open and you already have these LSPs configured, but you can't because they're standard input. Tej is big shaking, big no for me, dog. That's a big no. So we uh we run our stuff in parallel and there's also no configuration, like we have we ship out of the box support for a bunch of things and everything just downloads and and runs. Uh because you don't need like absolute precision for your exact LSP configuration that you like. It's more just giving the LLM something that something to work with. And then you said that's like at least for the way you guys have it. It's part of the edit file thing. So it's like, okay, I know that I need to go and like edit. Does it does it also have any feel like move a file or delete file like does it happen in every kind of like file up because like I'm just interested to know how you put LSP into everything. Or is it just like edits? That's a good point, like we probably should give it information whenever it changes. All right, make an issue right now. Well, Open Source Contributor. Let's go, Teej. Yeah, um, needs to do LSP diagnostics on file add, delete. Boom, crate. We we do it on right. So when you create a file, we do show the diagnostics to the tool. Or we we return the diagnostics. So edit and write, we're already returning them. I guess a move I didn't I don't know. We don't we don't often see. I guess I haven't done much with it where it's running like a Bash move command, but yeah, I think we also we also considered and we have this in there which have disabled for now. There's a tool that just returns diagnostics. So it can choose, hey, I want to look at diagnostics right now, and it can query for them. Um and we have seen it use that a bunch, but it was wasn't well tested, so we're not shipping with that right now. Yeah, the the models today, they're still very like tuned to call specific tools. Like we've played with a lot of tools and you can hand it a bunch of tools it's never seen before and it just it doesn't it doesn't call them. There's something too being like the the post training process being catered to certain sets of tools. So Anthropic is really Claude for, Claude 3/7 before that. Those models are the best at calling tools from a programming standpoint. They'll actually keep trying and and going for it. Uh other models can be really smart like Gemini 2.5, but it doesn't really it doesn't call tools very eagerly. So there is still like this phase we're in right now where you kind of have to like provide the set of tools that the the model expects. I don't think that'll always be the case. Uh but we've definitely given it a bunch of LSP tools. I've played with, you know, giving it go to definition and find references, things like that, and it just doesn't use them. I mean, you can get it to use them if you ask it to. Uh but it doesn't like it doesn't default to kind of thinking that way.

[18:58]I think that'll change. I think more models will get better at calling tools and it'll have other options. So I haven't I haven't built a model so I just have kind of further up questions, but here, Dax, why don't you go because then I I still want to keep on asking questions about this. Yeah, so one last thing on that, um I just totally blanked on what I was going to say. It's okay. Thank you for interrupting him, Prime. Yeah. It's okay. I forgot. I don't know if I interrupted him. We kind of like race towards it. I mean, Adam is still so upset about Twitch right now. I remember, I remember. I'm gonna go. So one of the missing pieces here is you can add stuff to the system prompt and say, hey, use this tool. That's where you're waiting for Cly to load or something. What's going on? Yeah, it was buffering. I need my cheat sheet.

[19:45]Uh one day. It's hard to tell right now whether when you're making something better, are you making something else worse? Because this LLM is such a black box. It's not like a deterministic system they can see the inside of. So what is missing and it's going to be the next major thing we work on is a set of consistent benchmarks that we can run whenever we make changes to system prompts. And these benchmarks are probably not going to be very quantitative. We're thinking that we're going to come up with a very real world looking code base. We'll have a bunch of like features that we needed to implement and we'll have a bunch of standard prompts to have it do that. Um and we have this nice like way to see for anything we do, like every single thing that happened diagnostics wise. Um so when we make a change, we can run it through this and see like the output and evaluate it somewhat qualitatively. So it's still going to be a qualitative thing, like did it get better or worse? But at least we have a consistent, we'll have a consistent set of things we're running, so we know right now we're kind of flying in the dark. We're like, yeah, this made it better, but I don't know if like some other person's having a horrible experience because of this change. So you're saying now you're doing not only vibe coding, but vibe testing, Dax. Vibe testing, vibe benchmarking, yeah, whole new world. It's really badly needed because there's so many benchmarks for the models, which everyone has opinions on. There are no benchmarks, Adam. Well, I I don't know. They're kind of worthless, I guess. But why? Why are they worth less, Adam? Say why? Why why are the model I mean a lot of strong opinions. The model benchmarks, they just don't seem to correlate with like actual real world usage. I mean. Why? I don't know. Cuz they're training to like hit the benchmarks. Are you guys like quizzing me? What is going on here? I'm trying to get you to say that they're all benchmarking Python.

[21:34]Oh, well, that's a thing. Yeah. You know, they're not all doing it, but the major the Sweebench benchmark is literally just Python, which blew my mind when I learned that. Yeah. There's not like a benchmark today for there where there's like a dozen agentic coding assistants. And there's no benchmark that says like, given the same prompt and the same code base, here's like it's part of it's like it's qualitative, but here's the one that did the best job. It did it the cheapest, it did it the most effectively. Uh you just have to have kind of like a grading system and there'd have to be humans in that process, but. And speeds another factor too, like how fast did it even do this thing. Someone did this funny thing. Someone linked me to something yesterday that was kind of interesting. They told the agent to like write a book. I can't remember exactly what it was. It was something where it would definitely like run in a loop for a long time and he checked like how many steps it did. I think the highest one, which was Claude, uh did 187 steps like to the LLM tool call to LLM tool call before before it finally stopped. Uh so that's a good way to like rank it to see like which one's the most persistent. A lot of these models are not very like they give up very easily. They run into an issue and they just stop. Just like humans. AGI has really been achieved internally. I'm giving up on this. They're going to just ask me again later. It's fine. Yeah, uh all right, I got real questions here. If I also agree RGI sucks. Okay. But now that we got that out of the way. Uh with that, um so like how do you start a loop and how do you determine that a loop's done? Like because like I know you're going to say here's what the user said. You have some sort of system prompt to say, hey, you can accomplish this task via these tools, whatever it says. How do you say, okay, I need to reprompt the system again to execute step one of a several instruction plan. Do you have to like manually parse it out? How do you like what's the interaction between the model and this to like start this looping process?

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript