Sports Simulation Games Repository

Links, Info & Tips for sports simulation games

Another good one by Colin Wyers

Posted by lukegofannon on June 15, 2009

“The one about sample size,” by Statistically Speaking’s Colin Wyers, one of the good numbers-oriented writers around, writing for The Hardball Times here:

… This is why April is the cruelest month for a baseball statistician; we know a lot of things are going on that are interesting and exciting and meaningful, but we simply don’t have the tools to suss out what’s true and what’s simply noise. All we are really left to do is throw up our hands and say, “Call us in June and we’ll see what we can do.”

There are a few tools you can use, though, if you’re not particularly concerned about being correct. The biggest one is confirmation bias. In other words, a small sample of something is valid if it says what you were already thinking to begin with. This is true to the extent that you were correct to begin with; the additional “evidence” presented by a guy getting off to a hot or cold start to April doesn’t add much to your argument. (Now, of course, a good player is more likely to have a hot start and a bad player is more likely to have a cold one, but not to the extent that a hot or cold start can tell us who is a good or bad player.)

There is, of course, another issue, that of the magnitude: The hotter or colder the start, the more likely it is to be true and not noise. But—but!—there’s something we have to remember about our measurement of magnitude. Recall that standard deviation is the square root of variance. And our basic formula for a measurement:

Measurement = True + Random + Bias

And the more observations we have, the smaller the value of random should be. And as randomness increases or decreases, so does our measurement of distance between a value and the mean. To see what I mean, look at these standard deviations of home runs per plate appearances, 1993-2008, grouped by number of plate appearances:

wyers article

Note the right-hand column: The standard deviation goes down with plate appearances. (There is still some “noise” there which could be smoothed out; consider this an illustration, rather than an actual solution.) So for someone with 100 PAs, a home run rate of .08 above average (in other words, about the rate Barry Bonds hit home runs in 2001) is five standard deviations away from the mean. We should expect to see that in only one out of every 1,744,278 cases, assuming home run rates are normally distributed. But for a player with only nine plate appearances, a home run rate of .08 above average is only two standard deviations away from the mean, which we should expect to see in about one in every 22 cases.

So for an observation to be extreme at a small sample size, it has to be more distant from the mean than it would in a larger sample size. This is especially important to bear in mind when dealing with splits data—batting in certain lineup spots, for instance, or batter versus pitcher matchups.
And the sky full of stars

Okay, but what if we find something dramatic – something three or four standard deviations away from the mean? That doesn’t tell us anything unless we know how many cases are under observation. From 1993-2008, there have been:

* 3,487 hitters
* 2,267 pitchers
* 611,547 unique batter-pitcher match-ups
* 14,676 player seasons for hitters (10,079 excluding pitchers hitting)
* 61,673 player months for hitters (48,138 excluding pitchers hitting)

Especially once you start splitting the data extremely fine, you should expect to see a lot of things beyond three standard deviations. The more specific the split, the more extreme cases you should expect to see.
Regression to the mean

As our number of observations increases, the noise goes down, and observations tend to become closer to the center of the distribution. That’s called “regression to the mean.” How much regression should we expect?

That depends on how much noise we pick up with our observations. We can measure that with our correlations, either year-to-year, intraclass, or some other way of testing self against self. The higher the correlation, the less we need to estimate the regression to the mean.

But the best answer is to simply use more data. Why should we regress Albert Pujols’ April stats to the mean? We have more than 5,600 PAs that tell us that Pujols is a very good hitter; we should deny ourselves of the advantage of all that extra data only when we have a very, very good reason to suspect it doesn’t matter….

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>