Edit: the equations I initially included in the article did a close job of dividing the data into four quartiles; but I repeated the calculations with an aim to get closer to the ideal division. While the entire article remains the same as its original form, the new equations have now been added to the bottom for future use. The change is minimal and doesn't affect any of the conclusions drawn.
I recently had some time on my hands and allowed myself to start learning a bit more about these advanced stats everyone keeps harping on about. One of the concepts I've been thinking about over the last couple of nights has been making the analysis of Corsi Rel in the context of Ozone start % a little more objective. There has been work done on this before, and this article by Eric produced a stat called Balanced Corsi Rel which essentially served that function. However, the process Eric undertook is probably a little less accessible to the casual stats person who occassionally checks how certain players have been going in terms of advanced stats. Bettman's Nightmare at Arctic Ice Hockey created some equations, which would be easy to work with for anyone, to estimate shots-for and shots-against with specific zone starts. And then there's this amazing work by Driving Play which is definitely worth a look too - but once again a bit tougher to work with for anyone who is not entirely initiated into the advanced stats world.
So what I'm doing below is not entirely unheralded, and a lot of it is me sandboxing with advanced stats. But if you're interested take the jump and let's see what we can find!
So the plan was this: let's find out an estimated 'expected Corsi' for a specific zone start. Once again, this has been touched on before. But I wanted to make an equation that would serve this function; that people could just plug numbers into and then compare the players they were looking at to the average NHLer.
After a few false starts, the approach I took was this:
- Take the CorsiRel and O-zone start % of every player in the league between 2007-08 to 2010-11 (from the excel documents up on Behindthenet.ca)
- Cut off anyone who played less than 30 games. This is an arbitrary number, but the number itself is inconsequential. The important thing is getting rid of any players whose Ozone start % or Corsi Rel is influenced by small sample size, and I feel it served that purpose.
- Cut the sample further by removing anyone who had a Ozone start % <40% or >60% - my reasoning for this is likely less sound, but I felt that a simple linear relationship (which is what I would be applying) would be less likely to work if I included the extremes. I'm not sure whether the relationships produced below can be extrapolated into the <40 or >60 range - and if so, I'm not sure how far they can go before they become inaccurate.
- Note: this left me with 2195 results to work with. Still a decent sample.
- Construct a simple median regression line to split this sample into two. The equation of this line was y = 0.55102x - 27.6612. Remembering here that y = CorsiRel and x = O-zone start %.
- Use this equation to generate expected CorsiRels for each O-zone start % in my sample. Then evaluate how many points in my sample were above the line and how many were below. The result was that 50.8% of results were above the line and 49.2% below. Therefore it's a fairly decent divider of the sample into two even halves. If a player's CorsiRel is higher than their expected CorsiRel, they are performing in the top half of the league. If a player's CorsiRel is lower than their expected CorsiRel, they are performing in the bottom half of the league. If it is equal to their expected CorsiRel, roughly half of the league are performing better than them, and roughly half are performing worse.
- I then took it a little bit further. I took the top half, and performed the same analysis. Then took the bottom half and performed the same analysis. For the top half, the equation I got was 0.5319x - 21.602. For the bottom half, the equation I got was 0.4804x - 29.17. This split the sample into 4 quartiles. The top (4th) quartile contained 25.6% of results, the 3rd quartile contained 25.1% of results, the 2nd quartile contained 24.2% of results, the lowest (1st) quartile contained 24.8% of results - so they functioned fairly effectively as quartiles, albeit not perfectly. But anyway, what this means is that if a player is performing higher than the line that defines the top quartile, they can say they're performing at a level typical of the top 25% of NHLers (at least when it comes to CorsiRel). If they're below the line that defines the bottom quartile, they are performing at a level typical of the bottom 25% of NHLers.
If you're confused at all, the graph below basically shows what was done:
On the x-axis is O-zone start %, on the y-axis is Corsi Rel. If a player's paired coordinate on this graph lies above the green line, they're in the top quartile. If it's below the red line, they're in the bottom quartile. If it's above the orange line, they're in the top half etc. The blue spots are all of my 2195 results.
So for example, here are the 2010-11 Flyers (little orange boxes) who played 30+ games and had Ozone start % within the range of 40-60 plotted onto the chart:
So through a simple visual look you can see who was outperforming their peers in CorsiRel based on the Ozone start % they were given.
But if you recall what I was saying at the start of this piece, my original intention here was to make an objective measure and revisit Eric's Balanced CorsiRel. So what we're going to do now is just that. The table below has the 2010-11 Flyers who fit the criteria mentioned above, their Median CorsiRel, the lower and upper bounds of the interquartile range (IQR), their CorsiRel, their quartile, and their Balanced CorsiRel (CorsiRel - median CorsiRel).
Not sure whether that's going to come out blurry so link to the image in its full non-blurry size here.
So we can see an affirmation of what has been suggested before. Jeff Carter was good. Sean O'Donnell was bad. Matt Carle does not deserve your criticism. Claude Giroux is amazing. And Nikolai Zherdev did not deserve to be benched because of Jody Shelley. For the guys that didn't fit into the 40-60% range, you can compare them to their closest peers and sort of extrapolate how they would've performed. Sort of.
My next step was going to be comparing them to the Balanced CorsiRel number in Eric's article - but it appears that his numbers were taken before the season had come to completion, and therefore we'd be operating on different Ozone start %s and Corsi Rels which really wouldn't make a decent comparison at all. Maybe that's something for later.
So I guess I'll leave you with how the current Flyers are doing on this little scale of mine (well, at least those who have played over 10 games and fall within the 40-60% O-zone start range).
And your zoomed in link is here
So once again we're looking at who's been effective at driving play forward relative to how the rest of the league tends to perform with their O-zone start %. We've got a few more players performing at top quartile level. One of them is Jody Shelley, but that might be due to the fact that his most regular linemate is Sean Couturier who would, without a doubt, be right at the top of the top quartile had I extended by analysis down into the 30s. You might be interested in how Jagr is doing, with that positive CorsiRel and 62.8% Ozone start %, so if I do happen to extrapolate my line forward a little bit, I find Jagr's Balanced CorsiRel to be 2.06 and he would be placed in the 3rd quartile (IQR for 62.8% zone start would be ~1.00-11.80). But I'd be hesitant to read too much into that, to be honest.
I was thinking I could take this analysis further and try to add more layers. There is still a big confounder in all this which is quality of teammates and quality of competition - Jody Shelley makes that vividly clear. But for now I'd really like to hear some feedback. What do people think? Feel free to pick out flaws. The equations themselves could probably receive some tweaking to get us closer to the magical 25.0 : 25.0 : 25.0 : 25.0. I did some very very basic math to create those lines. But as of now we're still fairly close to that and it remains fairly straightforward - something anyone can do if they have the equations written down somewhere.
The old equations (see below for the new ones), were:
For lower bound of IQR: y = 0.4804x - 29.17
For median value: y = 0.551x - 27.66
For upper bound of IQR: y = 0.5319x - 21.602
So if you want to try it yourself:
- Take a guy's OZ%, plug it in as the x-value (in whole number form: XX.X not 0.XXX) and your y-value gives your corresponding CorsiRel.
- To find the quartile: compare the player's actual CorsiRel to the three expected values given by the three above equations, and use that to determine where a player lies relative to the rest of the pack
- To find the Balanced CorsiRel: subtract the median CorsiRel for that x-value from the player's actual CorsiRel
And that's it!
EDIT: NEW EQUATIONS
For lower bound of IQR: y = 0.4964x - 30.0095
For median value: y = 0.5409x -26.9957
For upper bound of IQR: y = 0.5003x -19.8669
These equations divide the data such that the 4th quartile contains 25.1% of results, 3rd quartile contains 25.2% of results, 2nd quartile contains 24.7% and 1st quartile contains 25.0%. A total of 50.3% of results are above the median equation.