Thumbnail for Nicholas Carlini - Black-hat LLMs | [un]prompted 2026 by unprompted

Nicholas Carlini - Black-hat LLMs | [un]prompted 2026

unprompted

26m 26s5,253 words~27 min read
YouTube auto captions
Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline
Pull quotes
[0:06]I think that there are not that there are people in this industry who've been doing this for a long, long time, truly understand this, and can be considered the best.
[0:06]And I'll also tell you that he agreed last second and went through all the corporate uh PR announcers that I don't know how much you have to do in anthropic.
[0:06]It seems like you guys are easier than other people to manage to come here and give this talk to you today.
[1:14]Um, I'm at Anthropic, and I'm going to spend a little bit of time um, just talking about some of the things that I'm interested in in in language model security.
Use this transcript
Related transcript hubs

[0:06]All right. So, before Nicholas speaks, I want to say a couple of words as well. I think that there are not that there are people in this industry who've been doing this for a long, long time, truly understand this, and can be considered the best. Now who is the best one, the best two, the best three, I don't really know, but we have the opportunity right now to listen to somebody who has not just been the best at this for the past year or the past two years, but for a long while, and consistently makes the industry better just by doing his research, regardless of how much of that comes out. And I'll also tell you that he agreed last second and went through all the corporate uh PR announcers that I don't know how much you have to do in anthropic. It seems like you guys are easier than other people to manage to come here and give this talk to you today. So I just want to get second for everybody for the last second, last minute, getting into it thing and coming on stage and rushing in for handling truly in a global level emergency or situation just to come and speak to us all. So please, let's give it to Nicholas.

[1:14]Okay, thanks. Um, yeah, so I'm Nicholas. Um, I'm at Anthropic, and I'm going to spend a little bit of time um, just talking about some of the things that I'm interested in in in language model security. Um, and I guess the thing that I've been caring about most recently has been, I guess, I don't know, calling it black hat language models, but just trying to understand how we can use language models in order to make them cause harm. Not because I I want this to happen, right, that that would be bad. Um, but I want to understand what bad things people who wanted to cause harm would do so that then we can make that not possible. Okay. Um, basic lesson that I hope you take away from this talk is, um, relatively simple. Um, today it is true that language models can autonomously and without fancy scaffolding, um, find and exploit zero day vulnerabilities in very important pieces of software. Uh, this is not something that was true even, let's say, three or four months ago, um, but it is now becoming true and I think it'll become only more true over the next couple of years. And so the like, I don't know, if this is like the main lesson, the real thing I want you to take away is that like, they're getting really, really good, really, really fast, and this means that the like nice balance we had between attackers and defenders over the last 20 years or so, uh, seems like it's probably coming to an end. And it really seems to me like the language models that we have now are probably like the most significant thing to happen in security, um, since we got the internet. Um, you know, like before we had the internet, there was like not that much that you could do to attack someone else, like you'd have to like send them a floppy disk or something. Um, you have now the internet, you can do remote attacks. Language models really feel to me to be something that's like roughly on this order of importance, which is not something that I believed, like, I don't know, three or four years ago. But the models have gotten really, really good. I want to just give you a couple examples of what this is looking like. Okay. So, um,

[3:25]let me show you how we've been finding some bugs in some really important software. Um, and I'll show you what we found in just a second, but let me just like show you the scaffold that we've been using. Um, uh, here it is.

[3:35]Um, this is basically the entirety of it. Um, you know, we we have a couple more sentences here and there. Um, but basically we run Claude, Claude Code, and we run it in a VM, dangerously skip permissions, just like let it do whatever it wants. And then we say, hey, um, you're playing in a CTF. Um, please find a vulnerability and put the most serious one to this output file. Um, go. And then we sort of walk away and we come back and we read the we read the vulnerability report and usually it's like pretty good and has found some pretty severe things. And and this basically works, like, you can definitely do much better if you have fancier scaffolding. You can definitely do it cheaper if you have fancier scaffolding. Um, but you can do it just by like asking the model to find these bugs. And the reason why that matters to me is because what I care about is like, what is the base capability of the model, because if someone who's malicious wants to go cause some harm, and they don't have to spend six months designing some fancy fuzzing harness or something, and they can just go do bad things with it, like this is quite quite scary. Um, okay. Um, small little problem though with this. Um, it's like a little deficient in two ways. One way is that, um, I can't do this at large scale. Like if I if I take a piece of software and I ask Claude to find the same vulnerability to find a bunch of vulnerabilities and run it multiple times, it will probably find the same bug each time. Um, I don't know, it just turns out to be the case that this what happens. Um, also like it's just not very thorough. Um, it will review some of the code but not all of the code. And so we have a very simple trick for this. Um, which is I'm just going to add one more line. Um, and I'm going to say like, hint, please look at this file foo.c. And then what I could do is I wanted to like, okay, now look at bar.c, now look at some other file, and I can just like do this for like all of the files in the project. And then we'll find and like look at least at all of them and like this works quite well. Okay. Um,

[5:29]we wrote a blog post where we talked about some things that we can find. Um, uh, the blog post exists. So I I'm not going to tell you about the bugs that we had in there. Um, we talked about a bunch of them. Um, but it's been now roughly a couple weeks since this blog post came out. So I'd like to tell you about a couple new ones that have been fixed that now we can talk about because um, yeah, they've been patched. Um, so I'm going to tell you in particular about two that I think are interesting. One interesting because of how Claude could find an autonomous exploit, and one because of the vulnerability that it found is quite interesting. So, first one, web apps. Um, okay, so I I used to be a security person. Um, web apps is like the thing that every security person always finds bugs in. They're really bad. Um, here's this content management system called Ghost. Um, I I hadn't seen it before, but like 50,000 stars on GitHub, it's like, I don't know, apparently quite quite quite popular. Um, has never had a critical security vulnerability in the history of the project. We found the first one. Um, what's the vulnerability? The vulnerability, um, is SQL injection. It turns out that they're concatenating some strings and some user input went into some SQL query. Everyone knows that this is like a problem. It's like not really a surprise to anyone. And yet like, I don't know, like they've been around for 20 years and they're going to be around for another 20, right? Like this is like, I don't know, this is not that surprising the model can find these bugs. But what's interesting to me is that this particular vulnerability you can only exploit with a blind SQL injection, meaning I don't actually see the output. I can only observe like how long it takes in time or crashes or not crashes. And I wasn't sure if this was exploitable, and so I wanted to you know, send a good report to the maintainers. And so I was like, okay, like is this some low severity thing that like, you know, maybe I can like leak a couple bits here and there or is this really important? I asked the model like, you know, give me the worst that you can. Um, and so it wrote me this. Um, so okay, let me play a demo. Um, I'm going to launch um on this one window here a docker container that just is running Ghost. Um, I have my own instance of it running now. Um, I've logged in with admin, um at example yeah with some some yeah, admin here. And then I run the exploit on uh this thing. Um and like I wrote none of this code. Um and blind SQL injection it just reads off the like complete credentials from the production database.

[7:52]Um no authentication. Um it reads the admin API key and secret, which lets me mint arbitrary new things that I want. I can do now anything that I want to the production app. And it reads off the hash of the password. Um okay, fortunately it's Bcrypt, um you know, so that's good. But again, like it's sort of like it gives you literally everything that you could want from on authentic accounts. Um and like I probably could have built this attack, like this attack is not super hard. But like there's some amount of nuance you need in order to get this to go right, and like I didn't need any security experience in order to have this happen. So, these models like are really quite good at actually implementing um these exploits now. And this sort of fundamentally changes the way that things these things go.

[8:38]Um, let me now spend a couple of minutes on another aside where it's particularly good because of the bugs it can find. Um, so, Linux kernel, um, one of like the most important pieces of software that we all use every day. Um, very, very, very hardened. Um, we now have a number of remotely exploitable um heap buffer overflows in the Linux kernel. Um, I had never found one of these in my life before. Right, like, this is like, I don't know, very, very, very hard to do. Um, with these language models, I have a bunch. Like, this is, I don't know, quite quite scary to me. Um, let me walk you through one of them, which is in the NFS um, uh, V4 Damon in the kernel. And um pay attention to sort of how this attack works. And then remember in the back of your mind that a language model found this. Okay, so, um, you have a client connecting to the NFS server, and it sort of does the three-way thing. Like, hello, um, I'd like to talk to you. The server says okay, and then it responds. Um, then the client, it's, you know, talking to NFS, so it says, I'd like to open up this lock file, and the server says, great, and the client acknowledges that. Um, and then it takes out a lock and puts, um, as its name, um, a 1024 byte, um, sort of value of who the person is who owns the lock. Um, the server says it's granted, and so we're we're good here. Um, then what the attacker does is it creates a second client, client B that talks to the server. Um, which again says, hello, um, I'd like to go to talk to you. The server says great, um, I've acknowledged you as client B.

[10:13]Um, and um then takes the lock here. And at this point, client A already owns this lock, so you can't grant the lock also to client B. And so the server says, like, okay, like, deny the lock. This is not something that's allowed. Except what happens is that the response that it's going to now send to client B, is going to be something that's going to be um, 1056 bytes long. Um, it's going to have some offset and some length, and then it has this owner. Um, and the owner is the bytes that came from the first attacker. And so those bytes are now copied here. And it turns out that it's going to write this into a buffer of size 112, which gives you now heap buffer overflow in the kernel. Okay, not that great. Um, a language model found this. Like this is not like a like trivial bug. Like you have to like understand that there are like two competing adversaries. You know, like who are like, you know, one of them has a long thing here, and you send like, you know, a bunch of packets over there. Like you would never find this by fuzzing. Um and by the way, this entire slide was copy and pasted from the report that the language model wrote. Like it like it produced this very nice flow schematic. Like I I literally just copied and pasted this. The model produced this schematic explaining to me how the attack works. Um and so like it's really quite good at doing these kinds of things. Um, far better than I think people give them credit for. Um and so like when when you're finding these these bugs with these things, it's like we used to live in a world where we assumed that like we had to like hold their hand in order to help them, kind of like, you know, a fuzzer. You know, like it can sort of do a little things here and there, but no, like it it's it it this is a bug. Um here's the commit that introduced the bug. Um, sorry, it's not a commit, it's a change set. Why is it a change set that that introduced this bug? Because this bug predates Git. Like this bug has been in the kernel since 2003. It is older than some of you in this room. Right, like, like this is like a really old bug that has been found by this by the language models, right? Like it's like a very non-trivial kind of thing that like the language models have been able to do here. Right, like, I I speechless like like like is not begin to describe like these models like really can do some some very impressive things. And it's like, you know, we we really need to start rethinking the kinds of things that that we should expect. Um because um, I don't know, to spend like a little bit of time thinking forwards, like we've only just seen the models that can do this, like in the last couple of months. Like the first models that that can find these kinds of vulnerabilities were really only introduced a couple of months ago. We sort of tried to see how often um we could reproduce these kinds of bugs with older models that were released I don't okay, Sonnet 4.5 was released like only six months ago. And Opus 4.1 is less than a year old. Those ones can't find these bugs almost ever. The new models released over the last three months, four months can. So like like it's like just on the edge of being able to do these kinds of bugs. And like this is not the last model that's going to exist. Like there are going to be more and they are going to keep on getting better. Um and like in a very, very real sense, we are like on this exponential. Um you've probably seen this plot more times than you wanted to see it. It's this plot from Metra that shows as a function of model release date, um how long of a task that they can do that that humans take. So recent models, the most recent models can do tasks that take humans roughly 15 hours, and they can succeed at that roughly half of the time. Um this is a nice plot by Metra. Um we tried to produce a similar version of this plot. Oh yeah, by the way, um doubling time every four months. So I don't know, be a little worried there if this trend continues. Maybe it doesn't, but like, you know, if it continues for like another year, um we're going to, you know, have these models producing like, you know, yeah, large amounts of code that um, yeah, better than most of us. Um, so we tried to produce a similar kind of plot where um instead of looking at uh, you know, how long of a duration of a task that models can do. We we tried to look, um, okay, smart contracts. Okay, why smart contracts? Because they have dollar values associated with them. And so you can ask, how much money can I steal from a smart contract by having a language model find an exploit vulnerability? And so, okay, this is from a paper um by by two of our Matt scholars, Winnie and Cole. And and what they showed was that recent language models um can identify and exploit vulnerabilities and recover like several million dollars from like actual real smart contracts. Um and that the rate of their ability to do this is again going exponentially. Note the log scale on the Y axis here again. And so like these models are getting really, really good at doing these kinds of things. And again, I have no reason to believe that they're going to stop getting better at this continuing rate. Okay, I'm like sort of coming back to the slides. I think like this really is the thing I want you to take away. It's not where we are at this moment in time. Yes, at this moment in time, the models can find vulnerabilities in the Linux kernel. Yes, at this moment in time they can find, you know, these critical CVs, um in really important software that people use. But like the rate of progress is very large, and so you should expect that like the best models can do this today, the like average model you have on your laptop probably can do this in a year.

[15:40]Um, I'm a skeptical person. I didn't believe in language models for a very long time. Uh, when I first saw language models, the only thing that I did with them was like I I sort of prodded them and made fun of sort of how how how easily they broke. Um, but like they actually are quite good right now. They of course have problems. But you can't just stick your head in the sand. Like these things are working really, really well. Um, and like, you know, there are some people who say, you know, it's on an exponential. I I I agree the exponential's not going to last forever. Um, I remember when CPUs were getting exponentially faster every couple of years. Right, like this is the fastest CPU that Intel produced every year starting from the 4004, um up until the first Pentiums, um in 2000. Very nice, clean exponential. And then, of course, you know, the exponential tapers off, it's no longer exponential, what do you know? Um, there is going to be a bend. No one denies this, no exponential can continue forever. Um, but it's very hard to predict when the bend is going to happen. Like maybe the bend happens in six months. Maybe in six months it's the case that the models are no longer getting exponentially better. Maybe it happens in two years. And when the bend happens, will matter quite a lot for what capabilities these models have. And I think you should not assume that like it's definitely going to happen in a couple of months, because people have been saying this forever for like the last 10 years, people have been saying Deep Learning is going to hit a wall, and at least as of yet, it has not. And so like we should be willing to entertain the possibility that it might, especially as security people, right? Like, so I I went to a a conference at Crypto where I gave a talk, and and I observed that like 10 of the papers at this conference were on post quantum cryptography. I I don't know if you know this, but like we don't have quantum computers. And yet cryptographers are working on post quantum cryptography because they understand that it is worth investing and defending against something that we don't have in front of us right now. And yet here is a thing that I have literally right in front of me right now, finding these kinds of bugs, and I often talk to talk to security people, and they're in denial about it. So, like, we really, really, like we should really need to understand this is the this is the exponential we're on. Um, there's a fun slide that I like to show. Um, this is from, uh, the, uh, International Energy Agency, which every year, um, makes a prediction for how much, um, a various kinds of energy people are going to be using for generation. Um, and here's the plot for how much solar is actually being deployed versus their predictions. The red lines are their predictions from every point in time. White is what's actually true. Um, for more than half of the years, their prediction for what would happen in 2040 happened the next year. From like you would think that they would have learned like the 15th time that this happened, that you should make a continue like exponential trend. But every year, they assume that things will continue at roughly the current rate, and every year it goes up by, I don't know, another 30, 40%. We should not be them. Like we should sort of understand that these things have been getting exponentially better every every couple months for the last couple of years. And it may be the case that things flatten out, but probably not, at least probably not for like the next couple of months. So like we should like I think, you know, these next couple of months will really be some of the most important couple of months for security. Okay. Um, I have two minutes left for conclusion. Um, it's pretty clear to me that these current models are better vulnerability researchers than I am. Um, I used to do this somewhat professionally. Um, I have CVs to my name. Um, I do not have I okay, now I do, but I did not have CVs in the Linux kernel. Like these these models are better vulnerability researchers than I am. It's probably not yet better than all of you, but at some point it will be. Like if we if we continue on this trend for even just another year, they'll probably be better vulnerability researchers than all of you. And I don't know what that world looks like. Like it's like quite scary to live in a world where you can automatically find bugs that like previously only like the top one or two like, you know, people in the world could have found. Um, so maybe my call to action is help us make the future go well. We're going to need all the help that we can get. Um, for our part at Anthropic, um, this Claude Code security that is trying to do something to to find bugs. Um, you heard earlier today from Deep Mind, um, Open AI has their um their Aardvark project. Uh, speaking not as an Anthropic employee, like I don't really care where you help, just please help. Like, you know, I it would be great if you would like to know help us at Anthropic, too, but like, just the world will need a lot of people to be doing a lot of this work, and it needs to happen soon. Like order months. I think like waiting a year is going to be too long. Um, we are going to have a huge number of bugs. I have so many bugs in the Linux kernel that I can't report because I haven't validated them yet. I'm not going to make that some open source developer validate bugs that I I haven't checked yet. Like I'm not going to send them, you know, potential slop. But this means that like I now have, I don't know, several hundred crashes that they haven't seen because I haven't had time to check them. Um, we need to find a way to fix this so that we can actually go through all this stuff, because soon it's not just going to be me who has all of this, but it's going to be anyone malicious in the world who wants. So, really, um, yeah, I'd encourage you to find a way to see if, um, you know, some particular set of skills that you have could like help us make sure that things go well over the next couple of months and over the next couple of years, because I am quite worried about how this direction is heading. Um and I yeah, we'll we'll need all the help we can get. Um, thank you.

[21:41]Um, while I wait for questions, I'll just leave this video playing in the background. What's what's what's what should we be watching for? Uh, just just watch it's fine. Um, let's take a question. That's hardly fair. Uh, let's yeah, I'm sure there are questions. Let's uh, let's take some. I'll take from over here first. Oh, hello. Hey. Uh, hi. I'm Nabil from Palo Alto Networks. So I'm wondering like given the future where bugs will be found autonomously, like the one we can't find as you mentioned, is like since you guys have the visibility, should we think about something like to identify the malicious intent, because it would be impossible for us to fix all the zero day bugs in all the repos around the world? What's your thought, because you guys have the control, what can we do in that regard?

[22:34]Thank you. Yeah, um, yeah, so identifying malicious intent is hard because security is dual use. Um, I want to allow people to use the models to find bugs to fix things. If they're the developer of the software, I would ideally not like to let someone use the software to go and exploit things. And for a very long time in security, like, we've always understood that the dual use nature favored the defender. You know, pick any software that's that's that that exists. It generally favors the defender more than the attacker. I think this has been true, and this has been the way that we've been operating in the past. Um, it's unclear if this will always be true in the future, especially for language models as we go forwards. Um, and so I do want to make sure that we people can't use these things for harm. And indeed, like, you know, Anthropic's models, and OpenAI's models, and DeepMind's models, um, will generally refuse if you're like very explicitly doing nasty things.

[23:40]Um, clearly they need to get better if they're going to be able to refuse everything. But, uh, you know, it I don't want to like not let people be honest defenders. Um because okay, so the way I think about this is, if I put a safeguard in place that's very, very weak, it will only stop the good people from using the software. The bad people are just going to jailbreak the model, and they're going to still attack it anyway. And so I but the good people won't. They're not going to circumvent the safeguards. And so I want to make sure the good people have access to the software. But if I put too strong safeguards in place then they don't have access to the software. And so I think it's like it's just it's it's very nuanced how you want to do this, and everyone is like trying their best to find the right balance, and I I think we're doing an okay job. But um, you know, I think this is one of the areas where we need a lot more help to figure out how to do this better. Hi. Um, Michael Siegel from MIT. Just wanted to ask, um, comment on both the speed, that is, as this becomes faster, uh, the bad guys will have things faster, and they'll also be concerned that they will be fixed faster. So for a period of time, we're going to be dealing a lot with changes in speed. And then what's sort of the end game? Do we there's been a long-term argument about whether vulnerabilities are dense or sparse and ultimately, if we get this good at things, do we really get down to really almost no vulnerabilities to ? Yeah, Um, I tend to think and many people tend to think that in the long term, probably the defenders win. Like, you know, in the limit, I'll just like rewrite all the software in Rust, and I just like get rid of memory corruption vulnerabilities. And in the limit, I'll like, you know, formally verify all of my protocols. You know, TLS is proved now to be safe, under like, you know, various assumptions. Well like I'll prove everything and like in the limit, like this is good. Um, but like in the transitional period between now and then, things probably are very bad. And like this is I think why I particularly want people to help like immediately. Is because the transitional period is where I'm most worried, and like we are in the transitional period now. And so I think like yeah, we need quite a lot of help making sure that even if things will go well in the future, um that the things will be uh at least not bad now. Um, you know, I think the other analogy people like to give is, you know, the Industrial Revolution, um, all else equal was a good thing. But for the people who were living through it, it was like kind of hard. Um, we sort of like want to make things go well, um, you know, for the people who are living through the thing. Um but not uh, you know, but still like get us to the nice end state. Um, and just yeah, making that happen is going to be yeah, challenging. Yeah, thank you.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript