[0:00]One of the very confusing things about the models right now, they are doing so well on evals. But the economic impact seems to be dramatically behind. It's very difficult to make sense of how can the model on the one hand do these amazing things and then on the other hand, repeat itself twice in some situations, like an example would be, let's say you use vibe coding to do something. And then you get a bug. And then you tell the model, can you please fix the bug? And the model says, oh my God, you're so right. I have a bug. Let me go fix that. And it introduces a second bug. And then you tell you have this, you have this second bug. And it tells you, oh my God, how could I have done it? You're so right again, and brings back the first bug. Yeah. And you can alternate between those. It does suggest that something strange is going on. I have two possible explanations. Maybe RL-training makes the models a little bit too single-minded and narrowly focused. If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real world performance.
Transcript source
YouTube auto captions
This transcript was extracted from YouTube's auto-generated caption track. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.
Pull quotes
[0:00]One of the very confusing things about the models right now, they are doing so well on evals.
[0:00]Maybe RL-training makes the models a little bit too single-minded and narrowly focused.
[0:00]If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real world performance.
Use this transcript
Related transcript hubs
Watch on YouTube
Share
MORE TRANSCRIPTS



