Judging Goalies: Should We Include PK Save Percentage?
I've seen a number of discussions lately about the best way to predict future goaltender performance. The analytical community showed long ago that because a goalie doesn't see very many PK shots per year, simple luck doesn't come anywhere near balancing out and a goalie's PK Sv% bounces around almost completely randomly from year to year.
From that, it was natural to infer that the penalty kill just adds noise to our measurement of goalies and that we should focus on even strength save percentage (ES Sv%) instead of total save percentage. This would also presumably remove any unfair advantage a goalie gets in total save percentage by playing for a team that doesn't take many penalties. And so it became a widespread belief that ES Sv% was the best measure of goalie talent.
I personally made that argument just a few days ago, arguing that James Reimer's ES Sv% is a better predictor of his future results than his overall Sv% is. And yet when I went looking for an article that showed this directly, that made the leap from theoretical to empirical, I couldn't find any.
I asked around and most of the people who I talked to thought they'd seen such an article, but nobody could quite put their finger on it. Finally, Kent Wilson of Flames Nation tipped me off to an article by Tom Awad. The article basically asked the question, "if I want to see how a guy's going to do next year, which of this year's numbers should I look at?"
The answer was that whether you are trying to predict a guy's overall performance or just his even strength performance, you are better off looking at his total numbers for this year than his even strength numbers.
This is consistent with the idea that PK Sv% and ES Sv% measure largely the same talent, and that the variability of the PK Sv% comes mostly from the small sample sizes. If that were the case, removing the penalty kill results would be kind of like removing the last five games of each year -- the goalie's performance in the last five games isn't reproducible from year to year, so the data doesn't have much value on its own, but it still helps improve the overall sample size.
I wanted to figure out how much difference this really makes, and whether the answer to that question is dependent on how much of a sample we have to work with. To answer that, I started with the even strength and overall data for every goalie who has played since ES data became available in 1997-98. At the end of each season, I logged each goalie's career totals up to that point and his totals from that date forward. I could then look at how the career numbers predicted the future as a function of how many career starts the player had.
The blue curve represents how well we do at predicting a goalie's future overall Sv% by looking at his current career ES Sv% -- it trends upwards because the more games he has played so far, the more we know his true talent and the better our predictions are. The red curve shows how we do by looking at his current career ES Sv% instead when we make our predictions, and the green curve is the difference between the two.
It turns out that until the goalie has about 150-175 starts, the two measures perform almost identically in predicting a goalie's future -- it doesn't matter whether you use career Sv% or career ES Sv% early in his career. Once a guy gets up towards 200 starts, ES Sv% does start to look like a better measure (the green curve rises above 0), although the fact that the gap closes again by 300 starts leaves me wondering if this is just a statistical quirk.
The above plot includes all of the goalies who played in the last 14 years, even the ones who didn't play much. We're trying to predict a guy's future save percentage, but if the guy only plays 6 more games, it won't really matter whether he has a 20- or 200-game history for us to look at; we'll probably lose to the randomness over that 6-game sample.
To see whether that kind of noise was what made ES Sv% look like the better predictor over large sample sizes, I filtered the data to only include guys who went on to face at least 2000 more shots and repeated the analysis. Here's what we see in that case:
Now we see a much more steady rise in predictive power as a function of games played. This is partly because we have reduced the noise, but also partly because we have introduced some selection bias: a goalie is only likely to eclipse 300 starts if he plays fairly well, and he is only likely to face 2000 more shots if he continues playing well, so the correlations get pretty strong because the filter has introduced some bias.
However, that bias probably affects both inputs equally, so the difference between the correlations shouldn't be impacted much. And now we see that the noise wasn't causing the rise in the green curve at higher sample sizes; it was obscuring it. It's probably fair to say that ES Sv% does appear to be a better predictor than overall Sv% over large sample sizes (100+ games).
So the overall picture then is that with small sample sizes you want to include all available data, but with large sample sizes you want to focus on the most relevant data. Tom Awad showed that overall save percentage will give the best outcomes if you are using only a single year to make your predictions. Up to about 100-150 games of career numbers, overall save percentage and even strength save percentage perform similarly. And in the long run, after 150+ games, even strength save percentage is the better predictor of a goalie's future success.

Statistical post-script
I've chosen to look at the overall Sv% (rather than ES Sv%) as the measure of future performance because I think that's what we're trying to maximize when we pick a goalie -- ES Sv% comes into the conversation because we think it might be a better input, not because we think it's a more important output of the prediction. However, this could conceivably inflate the predictive power of overall Sv%; if a goalie is on a team that consistently takes a lot of penalties, his overall Sv% might be consistently lower than ES Sv% would predict.
As a check, I also looked at how ES Sv% and overall Sv% do when predicting future ES Sv%, and the results were almost exactly the same as when predicting future overall Sv%. So I'm not worried about this as a possible confounding factor. Here's the analogous plot to the first one above:
18 comments
|
Add comment
|
2 recs |
Do you like this story?
Comments
Wow, the difference between the filtered and unfiltered results is amazing. That’s awesome.
Lightning strikes once, Hextall strikes twice!
"I think there is virtue in pissing off idiots." - Fehr and Balanced
Even I
thought about that bump at around 200 starts. My immediate thought was, “we need a significant number of goalies with 600 starts!! It’s weather vs. climate!”
Then Eric put in a clever filter and all was right with the world once again.
I also assume that including PPSV% is completely useless, due to an extreme dearth of shots, perhaps until about 300 starts.
In overall save percentage, I included both PK and PP. I doubt the PP matters much, and that way we’re evaluating the widely-available Sv% numbers.
@BSH_EricT
Writer at Broad Street Hockey
Is this the stat that shows that we don’t need to worry about Bryz sitting on the bench for the next 8 years?
Yes. It’s also the stat that shows that we could have had the same piece of mind for just a few pennies less.
Driving Play - The Blog with Three First Lines
Follow @chasew12
Regression
Another fascinating post :-)
Two questions:
1) Have you, or anyone else, ever tried regressing future save pct. on past ES, PP, PK? Some adjustments would have to be made for repeated sampling on the same goalies (clustering, probably), and some other covariates would probably help (years in the league, etc.), but I think it would be interesting to see.
2) Is there evidence of score effects on sv%, like there seem to be on corsi/fenwick?
1) I haven’t done anything more complicated than calculate the single-variable correlations given here. I tend to shy away from deeper statistical analysis because I’m not an expert and I know there are a lot of ways to screw it up.
2) The article that comes to mind is http://www.arcticicehockey.com/2011/3/14/2041124/how-does-the-defensive-shell-work which saw a slight decrease for the trailing team, from about 7.0% when tied to about 6.5% when down by one late (I presume those numbers include missed shots).
@BSH_EricT
Writer at Broad Street Hockey
They do, and you also have this link: http://www.arcticicehockey.com/2009/10/29/1105149/shooting-percentage-by-game-state, which shows beyond just the 1-goal state.
Blueshirt Banter - Where Rangers' Fans Matter
Tracking the Rangers - Numbers don't lie. They just don't agree with you.
Twitter: RangerSmurf
"Oh, that sensible and sober* Rangers fan guy who is cool, actually" - Dominik, Lighthouse Hockey
*Statement has not been verified nor regressed
by George E. Ays on Jan 25, 2012 4:28 PM EST up reply actions
So the overall picture then is that with small sample sizes you want to include all available data, but with large sample sizes you want to focus on the most relevant data.
Stratify!
GMAT verbal section question, Philadelphia sports version.
In 2015, which one of the following will prove to be a better investment?
(a) Ilya Bryzgalov's contract (b) Ryan Howard's extension (c) Mike Vick's extension (d) Greek bonds from 2009 (e) Papelbon's bloat deal
Nice work.
I was about to point out that overall would have PK frequency built in and, boom you went over that in the postscript.
Driving Play - The Blog with Three First Lines
I really enjoy this guys articles I’m into stats myself
by StevenKerwood on Jan 27, 2012 1:20 PM EST via iPhone app reply actions
Late addition, courtesy of Twitter. Matt Fenwick’s comment led to Tyler Dellow taking a look back in ’08 at how an extremely high PK Sv% in one year can predict a decline of overall Sv% in the following year.
@BSH_EricT
Writer at Broad Street Hockey
It would have been a bit disconcerting if Total Save% failed to best quantify the situation. Power plays are a big component of the game and taking them out of consideration as was previously done feels like, well like leaving out flyballs in quantifying pitcher, or walks. It’s not the best analogy, there’s probably a better one with UZR , I just can’t think of one.
It’s not that PK Sv% were taken out completely, it’s that since PK Sv% fluctuates wildly from year to year, it was better to weight it less. But like anything, if you get enough of a sample size on PK Sv% to get a baseline, it’s useful.
In other words, if you looked at Ryan Miller’s Vezina-winning year, you’ll see a 0.919 PK Sv% helping him get a 0.929 Sv%. We see the 0.929, say that’s very high, spot the 0.919 and say “that’s not indicative of his skill, so don’t evaluate him on it.”
Man-crushin' on Boucher since 1999 and Matt Calvert since May 2010
Broad Street Hockey - Makin' it look mean since 1967.
SB Nation Philly - Associate Editor
by Geoff Detweiler on Jan 29, 2012 12:44 AM EST up reply actions

by 





















