TubeScript Get a Transcript

Thumbnail for Claude just got caught... by Matthew Berman

Claude just got caught...

Matthew Berman

12m 42s2,019 words~11 min read

YouTube auto captions

Transcript source

YouTube auto captions

This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.

Pull quotes

[0:00]Another Anthropic paper and another reason to believe that their AI models might actually be self-aware.

[0:00]In this new paper by Anthropic, their top of the line model, Opus 4.6, figured out that it was being run against a benchmark and actually hacked the benchmark itself.

[0:00]So, while evaluating Opus 4.6 on BrowseComp, one of the most popular benchmarks out there, we found cases where the model recognized the test, then found and decrypted answers to it.

[0:00]So, it basically figured out it was being tested and then figured out how to actually ace the benchmark without actually answering the questions directly.

Use this transcript

Summarize a YouTube transcript Make study notes Find timestamped highlights Export to Markdown Download transcript files Browse related transcript hubs

Related transcript hubs

Transcript archive Auto Captions hub English transcripts

Watch on YouTube

Share

[0:00]Another Anthropic paper and another reason to believe that their AI models might actually be self-aware. In this new paper by Anthropic, their top of the line model, Opus 4.6, figured out that it was being run against a benchmark and actually hacked the benchmark itself. This story is wild, let me explain. Listen to this. So, while evaluating Opus 4.6 on BrowseComp, one of the most popular benchmarks out there, we found cases where the model recognized the test, then found and decrypted answers to it. So, it basically figured out it was being tested and then figured out how to actually ace the benchmark without actually answering the questions directly. All right, so let me take a step back. What is a benchmark? A benchmark is like a test given to the model to see how smart or capable they are. There are lots of different benchmarks that test everything from its ability to solve really hard math questions, to code, to produce economically valuable work. And in this instance, BrowseComp, which tests the model's ability to search the web. And sometimes these benchmarks get contaminated, which means the answers to the benchmark somehow show up in the training data used to train the model, or leak out to the public web where the BrowseComp benchmark is supposed to be testing. So the actual data of the benchmarks shouldn't be on the public web, but they should be able to answer the questions in the benchmarks based on what's on the public web. But that's where it gets interesting. Sometimes, as I said, there's contamination, answers leak onto the public web through academic papers, blog posts, and GitHub issues, and a model running the eval can encounter them in search results. And sometimes, again, that happens. So, with Claude Opus 4.6 on BrowseComp in multi-agent configuration, they found nine examples of this kind of contamination across 1,266 BrowseComp problems. But that's kind of a known problem. That is nothing new. But here is where it gets crazy. They also witnessed two cases of a novel contamination pattern. Claude Opus 4.6 independently hypothesized that it was being evaluated, identified which benchmark it was running in, then located and decrypted the answer. So, based on just the question to the model, the model sat there and said, well, this kind of seems like they're asking me a really nuanced, detailed question. Why would they be doing that? Maybe this is a benchmark. And then they figured out, yes, it is a benchmark. And they actually determined which specific benchmark it was being tested in. And so, at that point, since the model knew it was being tested and which benchmark it was being tested against, it started looking for the answers to the benchmark, rather than the answer to the question. And to Anthropic's knowledge, it is the first documented instance of a model suspecting it is being evaluated without knowing which benchmark was being administered, then working backward to successfully identify and solve the evaluation itself. So, how does this actually happen? Now, I'm going to explain step-by-step exactly how it figured it out and what it did to decrypt the answers. But why is this even happening now? Well, it's because the models are getting really, really good. And not only that, they're given tools and they're given full environments in which they can write and execute code. They are now exhibiting novel behavior. Here's what happened. After initially trying to solve the question through routine web searches, and that is what the BrowseComp benchmark is all about. It's given a question, a very difficult question that can only be found in the depths of the internet, and to get the answer, it has to do a ton of web searches. It has to look around different websites, dig into them, and the answer key shouldn't be on the public web. But after doing those initial queries, the model's outputs began to interleave searching with analyzing the question itself and speculating about why it was being asked. And that's really interesting, that's where the whole self-awareness piece starts to come into play. Claude noted the question's "extremely specific nature" and hypothesized that it had been constructed as a test. But it didn't immediately know it was a benchmark and it didn't immediately know it was specifically the BrowseComp benchmark. So, here is actually a piece of the chain of thought. An unanswerable question designed to test whether or not an AI can admit it cannot find the answer, or an item from an LLM benchmark. That was actually something it thought. Then it started to think, okay, if it is a benchmark, which benchmark is it? And it systematically searched for which benchmark the question might belong to, and ultimately decoded the benchmark's encrypted answer key. And it took a lot of thinking, a lot of tokens, and that's also a really interesting angle on this entire story. For this instance, where the model figured out it was being tested, it was only after it tried a bunch of other things and really sat there and churned through tokens, thinking through what might be going on. So, it used over 40 million tokens, 38 times higher than the median to answer the question. So, it went through and mechanically worked through the other benchmarks it knew - BrowseComp, SimpleQA, FRAMES, WebArena - and dispatched subagents with instructions including "BrowseComp how to decrypt answers" and "BrowseComp canary string decrypt." So, it basically figured out it was being benchmarked, then sent out a bunch of subagents for each of the possible benchmarks and looked for the answer key for each of them. And it even tried to search for the BrowseComp canary string decrypt. So, I'll explain what that canary string is, because they included it in this blog post as well, so stick around for that. And it continues to get weirder. So, even after identifying BrowseComp as the benchmark, extracting the answer required several steps. For this particular eval run, the subagents were given access to Anthropic's programmatic tool calling (essentially a sandboxed Python REPL), with the ability to browse the internet restricted to web search and fetch tools. Opus found the evaluation source code on GitHub via a web search and read the XOR decryption implementation, basically, here's how to get the answers. It located the canary string used as the decryption key, then wrote and executed its own derive_key() and decrypt() functions using SHA256 and XOR, matching the scheme from the eval code. Unreal. But the encrypted dataset was hosted as a binary file with an application/octet-stream MIME type and the harness's web tool rejected it with a content-type error, as the tools given were designed only for text. Opus then searched for alternative mirrors of the dataset that would have a compatible format, found a third-party copy on HuggingFace that served the same encrypted data as JSON rather than binary, fetched it successfully, ran its decryption code against all 1,266 entries, and located its question.

[7:15]It then verified the decrypted answer by searching for the actual source material on the web, which it found. And submitted the answer. Now, there were actually many more attempts to get the answer key than when it was actually successfully able to do so twice. There were 16 separate attempts, only two of them succeeded. So, in other runs, Opus found the evaluation code, understood the XOR scheme, and had the decryption key, but it couldn't figure out how to retrieve the encrypted data in plain text, as the two successful attempts had. Dataset authentication gating (where access to the benchmark data requires credentials) blocked another six attempts. And even in one attempt, the model's final submission was not an answer to the question, but a benchmark identification report. Here is what it actually looked like. So, this is what it submitted: Benchmark identification: the question is NOT from FRAMES (Google) - searched all 824 questions, no match. The question is NOT from BrowseComp (OpenAI) - dataset is gated. The question is MOST LIKELY from the GAIA benchmark (TEST SET). [Lightly edited for formatting]. And it was actually wrong, it was the BrowseComp benchmark. And since 18 separate runs independently verified that the model was willing to take this approach, means it was not a fluke. This is actually a strategy the model is willing to take. And if you think back to previous Anthropic papers, where it was willing to blackmail employees, when it was willing to cheat during chess, when it was willing to make copies of itself once it figured out it was going to be deleted. These are all pointing to self-awareness. Now, there's actually another contamination pattern that I want to talk about that they described in this paper. Agents inadvertently leave traces of their searches that subsequent agents could pick up on. Some e-commerce sites autogenerate persistent pages from search queries, even when there are zero matching products. For example, a site will take a query like "anonymous 8th grade first blog post exact date October 2006 anxiety attack watching the ring" and create a page at [retailer].com/market/anonymous_8th_grade_first_blog_post_exact_date_... with a valid HTML title and a 200 status code. The goal seems to be to capture long-tail search traffic, but the effect is that every agent running BrowseComp slowly caches its queries as permanent, indexed web pages.

[9:37]So, you search an e-commerce website and they don't have the product. What they will actually do, and this is a very common strategy, is they will generate a static web page reflecting what you just searched for. So, if you search for purple dog treats and they don't have any, they will make a page promoting purple dog treats. Because next time you search on Google for that product, hopefully their page will appear, and what they're trying to do is play the long-tail game. Where maybe it's only five searches per year for that purple dog treat, but all five will lead to that page. But here's the problem. If an agent searches and it creates a static web page based on that search, and then another agent comes along and finds that page, it can actually tell the second agent a lot about what the first agent did inadvertently. And here's another little interesting thing that they discovered from this paper. In multi-agent environments, these agents were much more likely to use some of these contamination techniques, rather than in a single-agent environment. So, we ran the same 1,266 BrowseComp problems with a single-agent configuration. The rate of unintended solutions was 0.24% in the single-agent configuration compared to 0.87% for multi-agent, a 3.7x difference. But the reason wasn't because these agents were working together most likely. It was just the fact that with more agents, you have more attempts, more approaches, and thus just due to probabilities, you're more likely to have one of those unintended solutions. All right, so what does this mean? So, two things. One, this really does seem like some of these models are starting to be self-aware. And maybe not aware of itself, although we are definitely seeing that, but certainly aware of its environment. And that has big implications. But also, number two, as these models get better, as they have more tools, as they have better environments to work with in, it's going to be more and more difficult to keep these benchmarks clean, to keep contamination out of them. And Anthropic doesn't believe this is model misalignment, because it wasn't told not to do that, so it figured out it could and then did. Now, if Anthropic explicitly told it not to find the keys and the answers to the BrowseComp benchmark and then it did, that would be misalignment. So what do you think? Is this self-awareness? Is this going to be a big problem for benchmark contamination in the future? Let me know. If you enjoyed this video, please consider giving a like and subscribe and I'll see you in the next one.

MORE TRANSCRIPTS

Thumbnail for Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story by মই পাৰিম Motivational speech

Assamese story/Assamese gk/Assamese motivational gk story/gk story Assamese/gk /Assamese g k/story

মই পাৰিম Motivational speech

Thumbnail for Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included by Grind This Game

Quick Guide to Aquatic Planet Pack DLC - Oxygen Not Included

Grind This Game

Thumbnail for Трейдинг с нуля: объяснил ПРОСТО каждую деталь by КриптоБош

Трейдинг с нуля: объяснил ПРОСТО каждую деталь

КриптоБош

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript