[0:00]If you follow baseball at all, you probably know that this season has seen quite a few no hitters. In fact, the record in modern baseball history is seven no hitters in a single season, and this early into the season, we've already seen six. There is headline after headline after headline pointing out this fact, and all of them are trying to find a reason. The weight of the ball, the speed of the pitches, the sophistication of baseball analytics. All of these might contribute to the seemingly high number of no hitters, but there's another possibility. It's just random. Welcome to Data Demystified. I'm Jeff Gallick, and in this episode, we're going to consider the most fundamental question in all of statistics. Is something we see that is surprising, actually interesting? As in, is it surprising that there have been six no hitters in this year's baseball season? You bet. But is that interesting or is it just something we'd expect to see, given the randomness of the world we live in? This question of course doesn't just apply to baseball, but rather applies to everything around us. Is that time we bumped into a long-forgotten friend on the street a coincidence, or is it fate? Is that long streak of black coming up on roulette just randomness or is it a sign of what number will come up next? And bringing it home to statistics, is that correlation between two variables just random noise or is it something we should pay attention to? In all these cases, we are trying to differentiate between signal and noise, meaning and chance, intent and randomness. Humans, all of us, are fundamentally unequipped to make that type of assessment with any type of seriousness. We succumb to a whole range of cognitive biases that make it really hard to tell the difference between something that is random and something that is meaningful. And that certainly extends to the domain of baseball. So let's dig into this example and see if we could figure out a system for separating signal from noise. Let's start with some basics. A no hitter is when a pitcher or a team of pitchers doesn't let the other team have any hits at all during an entire game. This might sound like not much, but since 1875, nearly 150 years ago, there have only been 311 no hitters. And that's over hundreds of thousands of games played. No hitters are incredibly rare. Now, baseball has changed a lot over that time period, so to make a slightly more realistic analysis of this, I'm going to only look at games played since 1961. When the number of total games played each season remained relatively constant at 2,430 total games per season. In that time period, there were only 169 no hitters, and that is out of just under 150,000 games played. Over that time period, only 0.11% of all games played had no hitters. In other words, they are super rare. But how rare is it for a season to have six no hitters? Well, we can look back over this time period and plot what's known as a histogram. This graph shows how frequently we saw any number of no hitters in a season. About 8% of seasons, we have zero no hitters, 23% of the time, we saw one no hitter and so on. And if we look at six no hitters, we see that that only happened 5% of the time. Being a bit more generous, we might ask how often are there at least six no hitters in a season? And that happens about 12% of the time. So having six no hitters is unusual, but it happens. But does it happen because the pitchers were particularly awesome those years, or does it happen just by chance? To answer that, we turn to simulations. But before we do that, if you could take a moment to like this video, subscribe to this channel and click that little bell icon so that you don't miss out on any new content that I put out. I'd really appreciate it. With that said, let's see if six no hitters is actually something to be amazed at. Here's how we're going to approach this. We know that since 1961, out of about 150,000 games played, only 169 had no hitters. In other words, overall, the likelihood of a no hitter in any single game is about 0.11%. Now, those no hitters could be due to all sorts of things, like pitching ability, batter lineup, or anything else. But they could also just be flukes. As in the pictures could have just gotten very, very lucky. In fact, if we take that 0.11% as the likelihood that a game has a no hitter, ignoring everything else, like who the pitcher was or which team they were playing against, we can build a really simple simulation to see how many no hitters we'd expect to see in any given season. The way we do that is we use a random number generator to pick a number and use that number to indicate if a no hitter happens in a game or not. For example, we can pick a random number from 1 to 100,000. Then we can say that if the number is less than or equal to 110, we assume that a game will have a no hitter. If it's anything above 110, well, then it's just a regular old game, and that 110 is just my way of representing 0.11%, but scaling it to 100,000. This is now a simulation of a single game, and all we're saying for that game is that the likelihood of a no hitter is 0.11%. While in a typical baseball season, there are 2,430 games across all the teams, so we can repeat that step that many times. We then count how many of those games have a no hitter. But that only gets us one season, so we repeat that a bunch of times. In my case, I repeated it 10,000 times. In essence, I'm simulating 10,000 seasons of baseball and seeing just how often no hitters appear. If I do that, I get this graph. This is now a histogram of the simulated number of no hitters we'd observe if we know absolutely nothing about the games. Other than the fact that in 0.11% of them, we'd expect a no hitter. And here, we see that we'd expect six or more no hitters about 6% of the time. So yes, they are quite rare, but even if we know nothing at all about the games, even if we assume that the no hitters are entirely due to chance, with no scale involved at all, we'd still expect to see six or more no hitters a few times. In other words, it's unfair to immediately conclude that an extreme outcome like six no hitters in a single season must be due to some interesting explanation, like the size of a baseball or the speed of pitches. Instead, it might be due to chance. Let me say that again because it's so important to understand. In my simulation, there's nothing about pitcher quality or batting ability or any fancy analytics. It's all just random, and when it's all random, we expect to see no hitters with this frequency here. This graph could be about any rare event, not just no hitters. This could be the number of times someone has struck by lightning in a year or how many times someone wins a scratch off ticket in their lifetimes. The point is that randomness would get us here and it has nothing at all to do with the skill of pictures. But if you were paying close attention, you probably noticed that in reality, about 12% of all seasons had six or more no hitters, whereas in my simulation, only about 6% of all seasons had six or more no hitters. So is the difference between those two, the skill these news stories are talking about? Well, this is where things get harder. In reality, we only have 61 seasons since 1961 to look at. In the world of randomness, that's just not enough to draw any meaningful conclusions. Yes, there were more six and seven no hitter seasons in reality than in my simulation. But if I ran a bunch of simulations with only 61 seasons, rather than the 10,000 that I ran, I'd also find seasons where we had even more than 12% with six or seven no hitters. When we deal with small samples like 61, randomness does all sorts of crazy things. To overcome this, we can turn to statistics. Now, I won't get into the details of the exact statistical test I'm going to run. But basically, I'm going to test two cases. On the one hand, we might think that the shapes, the distributions represented in these two graphs come from more or less the same underlying data. On the other hand, we might conclude that they don't, and remember that one of those graphs represents randomness. We know that to be true because I created those data with my simulation. As in, we'd expect to see this exact graph if no hitters just happened at random, albeit they happen pretty infrequently.
[8:12]So if our statistical test says that it's likely that these two graphs actually came from the same underlying data, we can reasonably conclude that the real data, the graph of actual no hitters per season also comes from just randomness. In other words, we'd expect to see a graph like this one with as many six or seven no hitter seasons as we actually do, if there really is nothing to explain. If the likelihood of a no hitter is more chance than skill, and as it turns out, that's exactly what we find. This isn't to say that pitchers don't have skills, of course they do. And it's still certainly possible that as we get more data, as more seasons of baseball are played, our conclusion will change. But for now, with the data we have, this surprising result is just that. It's surprising, but it's not interesting. We expect to see six no hitter seasons every once in a while, and right now, we're seeing one. That's it. That's the whole story. There's really nothing else to explain. But if you're a sports journalist, I totally get that writing a story with the headline, we'd expect six no hitters in a season. Please read another article. Isn't really going to cut it. But that headline is probably a lot more honest than the ones we're actually seeing. This is an incredibly important topic in statistical intuition, differentiating between signal and noise. In this case, we can't make that differentiation for baseball no hitters. In other cases, we absolutely can't. What's key is that whenever you see something that is surprising, you must stop and ask yourself if that surprising thing could just be randomness. In many and perhaps most cases, it is. I plan to make many more videos on this topic because it's so important and I want to help you nail this critical idea. And if you have other interesting examples of randomness explaining what seems interesting, I would love to hear from you. So put a comment below and I'll make sure to keep the conversation going. Finally, as always, thanks so much for watching.



