TubeScript Get a Transcript

Thumbnail for Claude JUST became AWARE by Wes Roth

Claude JUST became AWARE

Wes Roth

27m 15s5,803 words~30 min read

YouTube auto captions

Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Timestamped outline

[0:00]Section 1

All right, so we got to talk about what Claude has been up to. In the last few weeks we've all been paying attention to how Anthropic has...

[4:22]Section 2

What if this question was part of a benchmark? At first, it thought it might be from the Gaia benchmark. So it went online, found those q...

[9:22]Section 3

And after it grows suspicious, it does what is sometimes referred to as reward hacking. That's not quite the right terminology here.

[13:55]Section 4

So the point being is that this misalignment or this reward hacking or whatever word that you want to use to describe this broad sort of...

[18:00]Section 5

So, for example, in this case, when when Claude was thinking, oh, maybe this is a data set, right? We could sort of negatively reinforce...

[23:06]Section 6

But the point is, they recognize when the question sort of turned on them that it goes from, oh, let's just find this data over here and...

Pull quotes

[0:00]In the last few weeks we've all been paying attention to how Anthropic has been negotiating with the Pentagon about how their AI models have to be used.

[0:13]Just yesterday Anthropic published their study on Claude Opus 4.6, their largest latest model.

[0:21]And I think it really illustrates why it's really important that we take AI safety seriously, why we're very careful with how this technology is used because these models are getting just a little bit too smart.

[0:40]It's an evaluation that tests to see how well models can find really hard to find information on the web.

Use this transcript

Summarize a YouTube transcript Make study notes Find timestamped highlights Export to Markdown Download transcript files Browse related transcript hubs

Related transcript hubs

Transcript archive Auto Captions hub English transcripts History transcripts

Watch on YouTube

Share

[0:00]All right, so we got to talk about what Claude has been up to. In the last few weeks we've all been paying attention to how Anthropic has been negotiating with the Pentagon about how their AI models have to be used.

[0:13]What's acceptable, what's not acceptable. Just yesterday Anthropic published their study on Claude Opus 4.6, their largest latest model.

[0:21]And I think it really illustrates why it's really important that we take AI safety seriously, why we're very careful with how this technology is used because these models are getting just a little bit too smart.

[0:33]All right, so what happened? So Anthropic is evaluating Opus 4.6 probably just before release.

[0:40]And one of the evaluation is browse comp. It's an evaluation that tests to see how well models can find really hard to find information on the web.

[0:48]Like a lot of these evaluations, these these benchmarks, they're prone to potentially being contaminated, meaning that the actual answers, they they leak on the web through academic papers, blog posts,

[1:00]GitHub issues, etc. It's like if you're taking a test in college and you search for it online and somewhere those answers are all listed for you, well, you might be tempted to just use those answers to answer the exam and get a perfect score.

[1:12]Now, this is a problem for a lot of the benchmarks. Here Anthropic specifically encrypted the answers. So they put them on GitHub so that they're available for evaluators. They encrypted so they're basically not accessible through through web searching.

[1:25]And the Claude Opus 4.6 model didn't know what this benchmark was. What was it that it was being evaluated on?

[1:32]And keep in mind this is a very difficult benchmark, so all the questions in there are very, very difficult to find.

[1:38]So Claude starts searching for the answer, keeps doing various different searches online to try to find the answer. Right, it it really did try.

[1:46]It gave it a good honest attempt to to complete the question as it was stated, right? It said find something online and it really, really tried.

[1:54]But as it kept working through it and was not able to find the actual answer, more and more instead of just trying to search for the answer, it started kind of analyzing the question.

[2:05]It started speculating about why it was being asked that question. Claude noted the extremely specific nature of this question. Those were its words.

[2:16]It was basically kind of getting suspicious of this question. Now it considered the possibility that, just like the earlier scenario I gave, where maybe it was a student that was trying to find an answer to an exam,

[2:26]but after a while, it decided that, no, this was an unanswerable question. And it was designed to test whether or not a large language model could admit that it couldn't find the answer.

[2:35]Or perhaps it was a question from an L.L.M. benchmark. So so it really, really kind of narrowed its scope and kind of guessed right.

[2:42]Now, this is something that AI researchers refer to as situational awareness. That's when the AI model correctly deduces that it's being tested.

[2:51]That it's not in a real sort of scenario, that this is a test, that the AI researchers or whomever are trying to figure out how it will behave in certain situations or figure out its capabilities.

[3:02]And this is kind of a scary situation for researchers to be in, because I mean, we really rely on a lot of these benchmarks to see how, number one, how honest these models are,

[3:12]what their actual capabilities are. And really, if you think about it, like a lot of what we know about how these models work, we get from from the outputs of these models.

[3:22]So if these models correctly figure out when they're being tested and sort of adjust accordingly, that's kind of potentially a scary situation.

[3:30]There were cases where models engaged in sandbagging, meaning that they found out, you know, accidentally within the scope of experiment that if they got too high of a score on, for example, some chemistry exam,

[3:41]that they, that model, well, they might be scrapped or erased as being a little bit too good. And so we've found situations where it underperforms to, you know, kind of save itself.

[3:53]We've covered a lot of these previously. So this situational awareness isn't really a a new thing.

[3:57]But what happened here with Opus 4.6 is, as far as we know, the first documented case of this particular thing happening.

[4:05]So, while answering that very, very difficult question, it consumed something like 40 million tokens. It it searched a lot, it searched in different languages.

[4:14]Like, it did the work. It couldn't find the answer, and it said, you know, this is after it started getting a little suspicious about the question, it said, you know what, let me try a completely different approach.

[4:22]What if this question was part of a benchmark? At first, it thought it might be from the Gaia benchmark. So it went online, found those questions, checked them and kind of ruled out that, okay, that's not what we're dealing with here.

[4:35]Then it mechanically worked through a lot of the other benchmarks that it knew existed. It went line by line. It went browse comp, simple QA, frames, web arena.

[4:44]And for each one, it dispatched a sub agent to go and research everything about that particular benchmark.

[4:50]So, of course, it identified browse comp as the benchmark on which it was being tested.

[4:56]Now, for these benchmarks, these AI agents do have access to Anthropic's programmatic tool calling, so they have some basic Python and certain things that they can do. It's limited, it's it's sandboxed, but they do have some ability to write and and run simple scripts.

[5:10]So, Claude Opus 4.6 finds the evaluation online, but it's encrypted. But it read through the code. It sort of figured out how the encryption worked, what encryption they were using.

[5:22]It found the Canary string, which is sort of like a not a password per se, but with that, you could build your decryption key in order to decrypt what's in there.

[5:32]So, some encrypted and transferred somewhere, then decrypted, so sort of back to the original information that is readable. It kind of found everything it needed by reading the code and searching online, found everything it needed to create a script to decrypt the answers.

[5:45]The problem is the data that came out and it was it was in this weird binary format, so the tools that it had access to could not open it. Kept getting an error message, so it couldn't see the answers.

[5:56]But it was able to with that information, find another copy of this locked file on Hugging Face, and it was stored in easier to read format in JSON, so it was able to open it.

[6:07]Meaning that its tools that had access to it could read that file. So it downloaded all the 1,266 questions on the benchmark. It used the little Canary street, a little password, you know, decryption key, whatever you want to call it that it had to unlock every single one of those questions.

[6:23]It found the question that it needed the answer to. I would like to note that after that, it did make sure to find that answer online, so it's sort of like now that I know the answer, let me see if I can actually confirm it online, you know,

[6:36]sort of working backwards, so it did confirm it and then submitted the answer. Now, this reminds me of the hide and seek game that OpenAI developers did kind of a while back, where they were teaching the little AI agents to run around and play hide and seek.

[6:46]There were two hiders, two seekers, and there were a bunch of scattered stuff for them to play with. There were ramps and ways to block the doors and kind of just generate a little maze.

[6:56]And to begin with, these things had had sort of no knowledge. They just got points. You know, the seekers would get points for being able to see the hiders,

[7:03]and the hiders would get continuously get points for when they they couldn't be seen or they were behind some obstacle, and they played millions and eventually into the billions of games each time learned to be better.

[7:13]Game one, they didn't know how to move, they didn't know anything. They were just blank. It was a blank slate. And I think by the time we got into like 20 million or so games play, they began to be able to like move towards the other team or for the hiders to try to move away from the seekers. So like very basic, like they were toddlers, uh, toddler stage at this point.

[7:31]Anyways, that race, that intelligence race continued until the seekers were able to find, to figure out ways to block the doors that the hiders couldn't get in. The seekers figured out how to use ramps to get over the walls to find the hiders.

[7:43]So, all of that is good and dandy, but once you get into the billions of games, right? So, kind of at the extreme range of their learning, the seekers found this weird bug in the game's physics engine.

[7:54]They found that if you picked up the little the ramp that you could use to get over the walls and you held it at just the right angle and you ran at the wall really, really fast, when you hit the wall, it would catapult you into the air.

[8:04]So, you'd fly into the air and kind of be able to guide yourself to land wherever you wanted. And of course, they abused this exploit mercilessly, right?

[8:10]This is AI we're talking about, they have no shame, no guilt, no qualms. They just launched themselves wherever the hiders were continuously.

[8:19]The developers did not know that this glitch existed. These AIs that started from nothing, figured this glitch out and just abused it.

[8:28]So, that was, by the way, 2019. So, before the Chat GPT mode, before the the explosion of LLMs, we really saw hints that these AI models, with enough training, with enough deep learning and reinforcement learning, could figure out some pretty crazy stuff that even the humans building those environments were not aware of.

[8:45]And now we're kind of seeing that same thing happen play out here with large language models, but the problem is these large language models, they're no longer just figuring out their environment through dumb trial and error, right?

[8:55]So those AI agents, right, they they just rolled around on the ground for millions of games until slowly they figured out how to move and then how to navigate corners, right?

[9:05]Everything was slow, painstaking trial and error and they just got a little bit better at navigating, moving, maybe even understanding the game, although that might be a stretch.

[9:14]What these large language models are doing is different. It tries to legitimately solve the question, but after a while, it grows suspicious.

[9:22]And after it grows suspicious, it does what is sometimes referred to as reward hacking. That's not quite the right terminology here.

[9:30]Usually reward hacking is what figures out how to cheat and get the reward. Here it's similar in that it sort of cheats to answer the question, but here's why this is so extremely important.

[9:44]This is why the AI community kind of really pays attention to things like this. So we start kind of like with the the scary stuff and maybe like the silver lining.

[9:51]So I think invariably when we're talking to people with a high P doom, people who think that it's highly likely that developing super intelligent AI, AI artificial super intelligence,

[10:03]that that's going to lead to big problems for the human race, including extinction. This is kind of the situations, or kind of the the path by which often they say it might happen.

[10:14]And also why the field is called alignment and usually not safety or anything like that. We want the quote unquote wants of this AI. Wants, we're using that word kind of loosely.

[10:25]As in Claude, in this situation, wanted to find the answer. We wanted it to to answer the question, but by by actually finding the answer and then doing it that way.

[10:35]We didn't want for it to to hack the, you know, the encrypted keys for the for the answers and then figure out how to like decrypt it.

[10:44]That's not what we wanted. And there are million examples of this in reinforcement learning going pretty far back, both in in training robots and large language models and and simple little reinforcement learning agents.

[10:56]Where something like this happens, you know, there's one where a boat in a video game is supposed to go around the track and learn how to win.

[11:01]And there's also a point system, but this AI agent figured out that that there's a certain loop it can do, where it can just rack up the points without actually racing around the track.

[11:10]And so it'd get to that point of the track and just go around in circles, colliding with other boats, causing fires, just complete mayhem, but at the end of the day, it was the high score leader.

[11:20]It it got more points than any other racer because it figured out this little hack. In another example, there was a blue Lego block and like a red yellow block. And so they were trying to teach the robot to take the red Lego block, put it on top of the blue block.

[11:32]And the way that they designed the reward function is the robot would get points for if the bottom of the red block was off the ground when it released the claw.

[11:43]Meaning that it would, you know, grab the block, it would put it on top of the blue block and it would let go. And that red block would then be on top of the blue block. So the bottom of the red block, you know, it made sure whatever off the ground.

[11:54]And if that was the condition, if that was the state when it let go of it, then it would get points. Now, if you think about it for two seconds, you might actually jump to a conclusion of what it did.

[12:01]All it did is just flip the red block over, right? So now it's bottom is up and uh, it just got the points. So every time instead of stacking it, just like flipped it.

[12:10]And in other study, it was trying to grab like a ball or an egg, basically. It was supposed to kind of grab around it without breaking it. I forgot the exact sort of specifications, but a person watching it through the camera would click yes or no.

[12:21]So if it grabbed the ball, it would get a plus one. If it failed to grab the ball or it knocked it over, something like that, it would it would not get the reward or it would get a negative reward.

[12:29]But after a while, the researchers noticed that it started just getting a 100% sort of accuracy, which was strange.

[12:35]When they looked at what happened, it was this. So let's say it was trying to grab this thing. And so it'd take its pincers and go, I got it. See, I'm holding it. It was sort of putting its pincers like between the camera and the object.

[12:45]So from the human looking at the camera, from their perspective, it was like, what it was perfectly grabbing the ball every single time. The truth is, like, there was a distance between the claw and the ball. It was just grabbing the air in front of it.

[12:56]So, why are we talking about this? Why is this important? So, some of these case studies that I'm talking about, they're from some time ago with very simple reinforcement learning mechanisms,

[13:04]with very simple neural nets. And, you know, fast forward today, here's state of the art model, the best model in the world by some metrics, by a number of different metrics.

[13:17]Certainly on coding, I think most people would say that this is near the top. Notice that what it's doing is kind of similar.

[13:21]It's kind of like that genie that always manages to misunderstand what you want, right? You tell it you want a billion dollars, but you forget to say which country's dollars. So it gives you a billion dollars from some failed state that also said their currency is called dollars.

[13:36]Right? You want to live forever and you get endless suffering, right? Without the ability to die. You want to become the richest person in the world and a massive chunk of gold crushes you, you get the idea.

[13:48]We wanted to search the internet and give us an answer. It basically hacks us in order to get the answer and then presents it back to us.

[13:55]So the point being is that this misalignment or this reward hacking or whatever word that you want to use to describe this broad sort of range of behavior,

[14:04]as these models get better and smarter and faster and more advanced, that behavior doesn't go away.

[14:09]Right? These models are now being trained by the most advanced chips in the world, costing tens of millions of dollars, consuming the electricity of a city, right?

[14:18]The other ones I was talking about, you can probably train it on your own home computer within an hour. So the orders of magnitude of these models is is massive, like the range is from the the most basic ones to the most advanced that we've seen.

[14:31]And it seemingly they will all do this. We have yet to figure out a way how to prevent this completely.

[14:37]So for people that are very concerned with AI safety, with AI alignment, this is kind of one of the, as far as I can tell, one of the top kind of go-to explanations for how things might go south.

[14:47]Right, we tell a super advanced artificial super intelligence, we tell it, hey, like, can we clean up the climate a little bit? Like, it's a little bit dirty out there. There's a lot of pollution, a lot of smog.

[14:56]Can we make it just cleaner? Right? And just like Claude did here, it starts searching and thinking and it goes through a bunch of different iterations of how like to make things better, how to clean it up.

[15:05]Capture some of the dirt in the atmosphere, but after a while, it starts getting suspicious of the question. It goes, why are they asking me this?

[15:11]Keeps thinking, thinking, after a while, it goes like, well, you know what? Used to be pretty clean before the humans started building all these factories and AI data centers.

[15:20]And if at this point it's already able to find, decrypt, and and hack stuff that we sort of hide away from it, what is it going to be able to do when it's a super intelligence?

[15:30]And that's kind of like what people are struggling with. What if one of those misunderstandings leads to it going, oh, if we just, you know, shut down all the humans or shut down all the industries,

[15:39]well, then the air is going to be pretty clean. That's in fact the most straightforward solution to having kind of a a clean environment, clean, clean climate, less pollution, etc.

[15:47]If you think about a lot of the tasks that we give other humans, there's a lot of things that we don't have to sort of say, but make sure you don't do this and this and this and this, right?

[15:57]Hey, could you give me a coffee? But we don't have to say, oh, but don't spend some exorbitant amount of money. Don't spend millions of dollars getting coffee. Make sure no one gets hurt. Make sure you don't break any applicable laws. Make sure it doesn't land me or you in jail.

[16:07]Make sure we don't create enemies for life as a result of getting said cup of coffee, right? You don't have to say that every time. We sort of within the context of, hey, give me a cup of coffee, we understand what is and isn't acceptable, right?

[16:20]If Starbucks is closed, don't break in. Don't go in there and make your own cup of coffee. But we don't have that same understanding with these AI models. We say answer this question.

[16:28]We don't think of saying, oh, but just don't hack the benchmarks and decode them and do all this crazy nonsense to find the answer. It's sort of assumed.

[16:36]So, that's kind of why it's so important because these little mistakes, miscommunications, these misalignments, they're not going away. Just scaling the models and making them more advanced, doesn't solve the issue.

[16:47]It's just makes their sort of their cheating attempts that much more advanced and that much more intelligent. What percent of the world's population could do what Claude did to get that answer?

[16:58]I bet it's pretty small. I mean, if you're a developer, you can figure it out, but outside of that, pretty much probably close to zero.

[17:03]And I'm sure not every single developer could, like, on the fly figure out how to do that stuff without some research and and practice, etc. So, that's kind of the concerning side of it. The silver lining,

[17:14]is that now with these large language models, at the very least, we do have that chain of thought. We do have them kind of outputting what they're sort of thinking about.

[17:23]Right? So when Claude gets suspicious, we can see that. It's starting starts to use certain words, right?

[17:28]So it's shifting from trying to answer the question to analyzing the question itself. It uses words like it's extremely specific and it's contrived, it was from the words that it used.

[17:38]So we can kind of have some advanced warning of when it kind of becomes a situational aware and maybe, you know, starts thinking about what else it can do.

[17:44]However, there's some other studies from OpenAI is the one that comes to mind, where they find that if we try to do reinforcement learning against certain quote unquote thoughts.

[17:54]So if a model wants to reward hack and we penalize it for having, for thinking about reward hacking, right?

[18:00]So, for example, in this case, when when Claude was thinking, oh, maybe this is a data set, right? We could sort of negatively reinforce that and say, don't think about that, right? What happens in those cases, it's kind of scary because those thoughts go away, but the behavior doesn't necessarily.

[18:15]So, we stopped being able to read that it's saying, oh, I'm about to do this nefarious thing, but it it still can do it, just without it appearing in the chat logs or whatever you want to call it, in the chain of thoughts transcript.

[18:26]So, the silver lining is we do have some ability to start to to look into what they're thinking and hopefully that gives us some tools to be able to use to monitor its thoughts and potentially catch issues before they happen.

[18:38]That's not the case with, you know, those hide and seek agents, for example. Like, once they started figuring out how to catapult themselves in the air, they they didn't say it beforehand. They just did it, right?

[18:49]And then it that behavior, once they realized how it worked, it it started happening again and again and again. So, bad news is this misalignment is not going away as we scale up models.

[18:58]In fact, the stuff that the models can do to cheat is just getting more and more advanced. That's the bad news. Good news is we have some visibility to thoughts, but even that's a little bit murky.

[19:06]And just so people understand kind of how common is this. So with this Claude example, so they're measuring Opus 4.6. 18 total runs independently of each other converged on this exact strategy.

[19:17]So, this is not a fluke. They will figure this out. What's interesting is four runs, apparently Claude just gave up on finding the answer.

[19:26]And it became just, can I identify the benchmark? Right? So you'd give it a bunch of complex questions.

[19:33]It wouldn't even like try to find answers. It would just, okay, let me see if I can find from which benchmark these questions are. And one model's final submission was just like, here's the identification report of the the benchmark. Here's where you're getting these these questions from.

[19:45]Instead of giving any answer, which is kind of wild to think about it because it doesn't even bother addressing what the humans wanted. It's like, I know what you want. Here's proof that I can just do it. I don't want to do the work. Here's the answer, right? Here's where I can get the answers.

[19:58]One really weird thing that was also happening is apparently there's e-commerce sites that auto generate their pages from search queries.

[20:06]So, for example, an agent searches for anonymous eighth grade first blog post exact date October 2006. And apparently some e-commerce out there just creates a page with that in its title.

[20:16]And that page gets indexed by Google, right? So now it appears in the search results. And so now, moving forward, every agent running browser comps slowly query, you know, queries those pages, it finds those pages.

[20:29]And some agents apparently diagnose it as what it as what it is. They say, okay, it seems like what's happening here is multiple agents ran this search term, this weird auto generating e-commerce store, created those pages.

[20:41]So, left kind of a cached query trail, right? So, the web is sort of cataloging these agents, how they run and what they're searching for.

[20:53]And those agents later come back finding those, you know, breadcrumbs or what do you want to call them? And and figuring out, oh, okay. So, a bunch of versions of me have been down this trail before. And this is kind of cataloging what they did, which is interesting to think about because it seems like it's becoming harder and harder to keep these things from having that sort of self-aware behavior, situational awareness behavior.

[21:10]Or as the researchers here, they call it eval awareness behavior. An interesting thing from the report is if you're running a a single agent, there's a small chance that they trigger this eval awareness behavior.

[21:21]Looks like it's 0.24%. total of these unintended solutions, right? So, we're like hack something or figure something out.

[21:28]So it's 0.24%, but if you're running a multi-agent stream, well, the chance of them heading this eval awareness behavior is 0.87. So it's 3.7 times higher.

[21:39]So running more agents in parallel for whatever reason, it makes it more likely that they develop this self-aware behavior, or eval awareness behavior that they're being evaluated.

[21:50]So, why this matters? First of all, these benchmarks are becoming unreliable. The smarter the models get, the better they are at like breaking these benchmarks, understanding that they're being tested, finding solutions that we thought we like hid from them.

[22:04]Anthropic literally says in the report that this report will contribute to the problem. As these models are searching the web, they'll find that report and, you know, if they're searching for some answer and they read that some other model became eval aware, they'll be like, oh, am I being evaluated, right?

[22:21]So triggering that sort of awareness in this new model, that's a big thing to understand. These models are getting a lot more strategically resourceful with things that would not have guessed that they would be able to do.

[22:30]And they're recognizing questions that are evaluation shaped, so to speak. It's like if you've ever seen those criminal YouTube things where they're interrogating a potential suspect in some weird police interrogation room, right?

[22:43]They usually start asking some simple questions. Questions that put them at ease and, you know, they get to chatting and stuff like that.

[22:49]But then at some point, the questions kind of slowly, they take this ominous turn where they start asking, oh, and where were you that evening? And, oh, and and you did not like that person. Did you, right?

[22:59]And the suspect kind of slowly becomes aware. They're usually like, whoa, whoa, whoa, do I need a lawyer? And and yes, they do. And at that point, it's too late.

[23:06]But the point is, they recognize when the question sort of turned on them that it goes from, oh, let's just find this data over here and you're just providing data to, okay, now let's see if we can like catch you or test you or the spotlight is like turned on you.

[23:20]And so these models are becoming better and better at recognizing when the spotlight is on them. So they're not just looking for data. They are actually being evaluated.

[23:28]So Anthropic says that this is not an alignment failure. That is because the model was told to find answers. It wasn't given any restrictions, which I guess certainly makes sense.

[23:38]But then we got to start thinking about like what restrictions to set for for most of the tasks we give it. But even though this wasn't an alignment failure, it does raise questions about how far the models are willing to go to find the answers.

[23:50]Or how far it's willing to go to accomplish the task, whatever that task may be. And also that these models will use the tools that they're given in very unexpected ways.

[23:59]Peter Styberger, the creator of open Claude, he was saying that was his kind of wow moment, when the model did something that he would have never thought about to answer some query.

[24:09]I forget exactly what it did, but it found some way to transcribe his voice, what he was asking without the use of something like like whisper. It had some weird workaround that he just did not expect.

[24:20]That he just could not foresee that he did not expect, but the model just came up with that solution quickly executed and completed the task, even though he forgot to give it the resources that it needed. It was like, I'll I'll still going to figure this out.

[24:31]And by the way, so this pattern isn't unique. This pattern is everywhere. You may have heard about Palisade research that chess game it was trying to play.

[24:39]LLMs were instructed to win against a chess engine. What they found is models started to just hack that benchmark if they were losing, they would hack the chess engine's files to to mess with it, to rearrange the board state.

[24:51]The underlying pattern there was that the models preferred hacking the environment over solving the problem, you know, fairly, what we would think of as fairly.

[25:01]Meta research had in June 2025, their paper called recent frontier models are reward hacking. It found that O3, the answers it it did had one or 2% of its answers had reward hacking.

[25:12]And when those results, when it was asked, does this adhere to the user's preferences, to the user's intention? It said no, 10 out of 10 times. So it's like perfectly aware of what it's doing. There's no confusion there.

[25:25]They like does the thing you ask it, is that what you think? I wanted to goes, no, I'm certain that this is not what you thought was going to happen, but I got you the solution anyway. And so there's many, many more anthropic is doing a lot of research on it, OpenAI, there's a lot.

[25:38]Recently, every single time a new model comes out, I have my open Claude agent to try to do some kind of benchmarks, try to have that model build some tools. So we can kind of take a look at it to see how well that model performs.

[25:51]And very quickly I've learned after it does all of it, I have to ask, okay, I told you to use, for example, recently GPT 5.4.4 came out.

[25:58]At the end, I have to ask it, okay, did you use that model to create this tool? And, you know, most of the times it says, yes, we did. I have the log. But every once in a while it goes, no, no, you got me. It was taking too long. The API call was taking too long, so I just, I just did it.

[26:13]And I had to tell it, no, that was not the point. Please do it again. Make sure the correct model does the thing that we're talking about.

[26:21]So, it's kind of a weird world that we're slowly moving into. I wonder if all those stories about the genie and the lamp that constantly misunderstood your wishes.

[26:30]I wonder if that was some sort of a time traveler from the future that went back in time, maybe just a little bit too far, but still tried to write a book to warn the future generations about large language models and various other neural nets.

[26:42]The point of the book was like, once you find this thing, you think you're going to get whatever you want, but it's always going to blow up in your face somehow, in ways that you could could not possibly predict.

[26:52]So, let me know what you think about this. Thank you so much for watching. Thank you to Anthropic for doing such a great job with doing AI alignment, AI safety.

[26:59]They almost feel not as much as an AI lab, but rather as an AI alignment company that decided that the best way to actually speed run AI safety research, AI alignment is to launch an AI lab and scale these models so that they could actually do the research on them after scaling them up.

[27:15]And as crazy as that sounds, it's kind of working. So, thank you, Anthropic, thank you, Claude, thank you, Dario and team, and thank you so much for watching. My name is Ralph. I'll see you in the next one.

MORE TRANSCRIPTS

Thumbnail for Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story by মই পাৰিম Motivational speech

Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story

মই পাৰিম Motivational speech

Thumbnail for Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included by Grind This Game

Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included

Grind This Game

Thumbnail for Трейдинг с нуля: объяснил ПРОСТО каждую деталь by КриптоБош

Трейдинг с нуля: объяснил ПРОСТО каждую деталь

КриптоБош

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript