Each spring, as we start the slow buildup to Spring Training, we’re somewhat inundated with preview pieces – prospect lists, trade rumors, and player projections. While the first is a largely subjective compilation (created through a combination of statistics and expectations based on a player’s tools) and the second can be based on literally anything (“The Mets need a shortstop? Why not Derek Jeter!” is probably an idea floated somewhere on the internet non-facetiously), MLB player projection models are a somewhat rigid science. The projection models adapt from year to year, simply because we have another 162 games of information to help us see how players develop.
We see the output of many of these models – ZiPS and Steamer are commonly cited examples – on players’ Fangraphs pages. But how do these systems generate these numbers? And why should we consider them worth regarding?
Well, lucky for all of us, I’ve created my own! Presenting:
The Haefeli Projection Algorithm (Version 1.0)
What aspects of performance are we using to project?
Thus far the model doesn’t include stolen bases, HBP’s, or sacrifices so it’s still pretty rudimentary. Granted, the latter two are generally a crapshoot, especially since sacrifice flies/bunts depend on the work of others to be on base for you. Right now, we’re only looking at hitting. And for the most part, there are only a handful of things batters can control:
1. Their batted ball profile* (line drives, ground balls, fly balls)
2. Their plate discipline (walks, strikeouts)
That’s about it. Other aspects – power and speed – aren’t necessarily controllable, but do tend to remain consistent. We’ll look at those later. But for now, we have our basic inputs for a projection model. I’ve written somewhat ad nauseam about Ruben Tejada in the past, and it’s pretty unknown what we can expect from him this year, so let’s use him as a quick test case.
Here’s our inputs from Tejada’s last three seasons
Unfortunately for him, most projection models weight recent performance more heavily than past performance. I make no exception. Using a 0.45/0.35/0.2 weighting ratio, we can put the heaviest emphasis on recent performance (45%) without ignoring that the previous two seasons (55%) are a larger sample of data.
What do we do with this information?
So far we know what Tejada’s done in the batter’s box. Next, we need to know what the results have been. To do so, I’ve taken the results of his batted ball types – hits – and divided them into hit types. Below is an example, with his fly ball data:
Note that the FB column is fly ball hits, not total fly balls. Since this is a BABIP-based model, and BABIP stands for Batting Average on Balls In Play, we need to keep the home runs separate (fortunately for our examples / unfortunately for the Mets, Ruben only has one). So, of the fly balls that land for hits in the field of play, we can figure out what percentage turn into singles, into doubles, and so on.
Rinse and repeat for line drives and ground balls.
How do we end up with MLB Player Projections??
So now we have this data, which we again weight relative to its age, that we can use to take a guess at what Ruben Tejada will do in 2014. For the sake of discussion, I’ve normalized the projections to 600 plate appearances – same as the Oliver projections – but that’s a stylistic choice more than anything. A full “qualified” season is 502 PAs, and a healthy starting player should accumulate somewhere between 600-700 over the course of a season.
I’ve used league BABIP data – which generally doesn’t vary much from year to year – to convert the projected number of line drives / fly balls into hits. Relative to the system, a fly ball that stays in the park becomes a hit roughly 12.6% of the time. Over the course of 600 plate appearances, we can project that Ruben Tejada will hit roughly 154 fly balls, and that 20 of them should fall in for hits. From the chart above, we can see that 54% of his fly ball hits (again, we’re not counting the home run) have been doubles, so we can project 9 singles and 11 doubles out of those 20 hits. By using HR/FB (home runs per fly ball) and HR/LD (same, per line drive), we can separately calculate how many of each type can be expected to leave the park.
Where this becomes tricky, though, is with ground balls. While speed doesn’t factor much into fly balls (a runner on second is just as out as a runner on first if the ball is caught), it factors significantly into grounders. Fast players can be expected to beat out more ground balls than slower players. Because of this, it seems conservative to assume a league average ground ball BABIP. Los Angeles Angels outfielder Mike Trout, for example, had a .361 BABIP on ground balls in 2013, Ruben Tejada’s was only .188.
Putting it all together, I’ve projected that Ruben Tejada should hit .261/.311/.326 this year. Here’s how it compares to the projection models listed on Fangraphs:
Fairly well! I’m proud of myself. But of course, it’s easy to do with one player. Let’s test it out with a couple others:
Here we have a nearly-perfect match of the Steamer projections.
It’s not as close to the others as the first two, but still firmly middle-of-the-pack.
Forgive me, the post is a bit math-ier than I was originally expecting. While there’s still a long way to go in developing the model (including creating a pitching model), hopefully there’s enough here for you to get an idea as to what some of these guys were thinking. Feel free to ask questions in the comments below, I’ll do my best to answer them.