Expected Strike Difference -- A simple catcher framing metric

In this post I’ll build a simple catcher framing metric I’m calling Expected Strike Difference. The approach is similar to some other recent posts of mine like this one. Using pybaseball Statcast data I’ll train a classifier to predict whether a pitch will be called a ball or a strike based on the location where it crosses the plate, the count, handedness of the hitter. Then I’ll go through every pitch caught by each catcher and count up the differences between the predicted strikes and called strikes. This difference theoretically represents the number of extra strikes stolen (or lost) compared to an average catcher.

The Data

First the basics.

In the Statcast coordinate system, x is inside-outside in the strikezone with the first-base side positive, y is toward-away from the catcher with the field side positive, z is up-down with up positive. Statcast coordinate system

Here’s what the observed strike zone looks like: Strikezone heatmap

I normalized the pitch location such that the plate spans from -1 to 1 in each dimension. That’s what the “norm” means in the axis labels. The unit here is half plate-widths since -1 to 1 is a full width.

The color represents the probability of a pitch being called a strike based on location. The strikezone is a bit wider than the nominal 17 inches of the plate. Most of this is due to the width of the ball itself (about 3 inches or or 0.17 plate widths).

Next, here a few fun details I found and tried to account for.

Batter handedness

I noticed that the strikezone looks a bit different for left handed and right handed hitters. Specifically, it seems to be offset a little bit away from the hitter. I decided to correct for this by reversing the x direction of the strike zone for lefties.

Strikezone x coordinate uncorrected for handedness Strikezone x coordinate corrected for handedness

These plots show the just the x coordinate of the strike zone for lefties and righties. In the left plot, the yellow band is offset to the negative-x side for lefties compared to righties. After correcting for handedness by reversing x for lefties, this offset is mostly gone in the right plot.

Two strikes, three balls

I also noticed that umpires are less likely to call a marginal pitch a strike when there are two strikes. I’ve heard this called “omission bias” and it is a well documented effect in strike calling. Basically, umps are a bit adverse to calling strike three or ball four and directly ending the at-bat.

Strikezone x coordinate with and without two strikes

This plot shows a skinnier yellow band when there are two strikes and less than three balls. This indicates that the effective strike zone is smaller in that situation.

Similarly, they are less likely to call a ball with three balls and less than two strikes.

Pitch movement

My intuition was that the transverse velocity (the left-right and up-down velocity) of the pitch will have an influence on the umpire’s call. For example, I expected a pitch at the bottom of the zone to have a better chance of being called a ball if it is breaking sharply downward. I was not able to observe a clear effect like this in the data so I left it out of this model.

The features

Ultimately, the features I used to estimate strike probability were:

  • Normalized and handedness-corrected x position when crossing the plate
    • Runs from -1.0 on the outside edge of the zone to 1.0 on the inside edge
  • Normalized z position when crossing the plate
    • Runs from -1.0 on the low edge of the zone to 1.0 on the high edge
  • Whether there are two strikes and less than three balls
    • Set to 1.0 if there are two strikes and less than three balls, 0.0 otherwise
  • Whether there are tree balls and less than two strikes
    • Set to 1.0 if there are three balls and less than two strikes, 0.0 otherwise

Here’s what the data processing code looks like if you’re curious. It uses PyBaseball and Polars.

PLATE_WIDTH_IN = 17
PLATE_HALF_WIDTH_FT = PLATE_WIDTH_IN / 2 / 24
balls_strikes = (
    pl.from_pandas(statcast(start_dt="2023-04-01", end_dt= "2024-04-01"))
    .filter(
        pl.col('description').is_in(['called_strike', 'ball'])
    )
    .select(
        pl.col("fielder_2").alias("catcher_id"),
        (pl.col("description") == "called_strike").cast(int).alias("strike"),
        "plate_x", # x location at plate in feet from center
        "plate_z", # z location at plate in feet from ground
        "sz_bot", # bottom of strike zone for batter in feet from ground
        "sz_top", # top of strike zone for batter in feet from ground
        (pl.col("sz_top") - pl.col("sz_bot")).alias("sz_height"),
        (pl.col("strikes") == 2).alias("two_strikes"),
        (pl.col("balls") == 3).alias("three_balls"),
        (1 - (2 * (pl.col("stand") == "L"))).alias("handedness_factor"), # -1 for lefties, 1 for righties
    )
    .with_columns(
        ((pl.col("plate_z") - pl.col("sz_bot")) / pl.col("sz_height") * 2 - 1).alias("plate_z_norm"), # plate_z from -1 to 1       
        (pl.col("plate_x") / PLATE_HALF_WIDTH_FT).alias("plate_x_norm"), # plate_x from -1 to 1
        (pl.col("two_strikes") & ~pl.col("three_balls")).alias("two_strikes_and_not_three_balls"),
        (~pl.col("two_strikes") & pl.col("three_balls")).alias("three_balls_and_not_two_strikes"),
    )
    .drop_nulls()
)

K Nearest Neighbors

I used a method called K Nearest Neighbors(KNN) to establish the expected strike probability for each pitch. For each pitch, I look up the 25 most similar pitches. The set of similar pitches is called the “neighborhood” and (in this case) it’s determined based on distance when plotted on a graph. For example, the neighborhood might look like this for a pitch near the inside corner for a right handed hitter.

Inside corner neighborhood

The expected strike probability is defined as the the fraction of pitches in the neighborhood that were called strikes. In the data I used, the neighborhoods are much smaller than the one depicted. There are so many pitches that the 25 nearest ones are close together.

Two of the features I used (two strikes less than three balls, three balls less than two strikes) have values that are either 0 or 1. These have an interesting effect on the neighborhood calculation. Because of the relative scales involved, neighborhoods will almost always include only pitches with the same value for these discrete features. This is okay though.

Essentially I have three separate classifiers here: one for 0-2, 1-2, and 2-2 counts when umpires are more inclined to call a ball, one for 3-0 and 3-1 counts when umpires are more likely to call a strike, and one for all other counts.

Inside corner neighborhood

Results

With this KNN method it’s easy to evaluate a catcher. For each catcher, look at the pitches they received. For each pitch find the neighborhood pitches and calculate the fraction that were called strikes. Add up these fractions to determine the number of expected strikes. Compare that sum to the total number of strike calls they got.

The resulting stat correlates well with Fangraphs and Baseball Savant catcher framing statistics. Correlation with other statistics

Finally, here’s how the numbers come out for 2023:

Name Avg expected strike difference
(per 100 pitches)
Total expected strike difference
Francisco Álvarez 1.52 125.72
Patrick Bailey 1.94 124.24
Austin Hedges 2.33 118.20
Jonah Heim 0.83 77.64
Jose Trevino 1.87 72.00
Víctor Caratini 1.61 68.20
Cal Raleigh 0.77 64.00
Jake Rogers 0.87 60.76
Alejandro Kirk 0.83 58.56
William Contreras 0.70 58.24
Adley Rutschman 0.70 56.96
Seby Zavala 1.22 54.20
Kyle Higashioka 0.99 53.88
Jason Delay 1.11 47.48
Cam Gallagher 1.36 46.28
Christian Vázquez 0.69 43.68
Tucker Barnhart 1.54 40.64
Yasmani Grandal 0.66 39.64
Austin Wynns 0.95 31.72
Joey Bart 1.25 25.00
Freddy Fermin 0.58 24.64
Miguel Amaya 0.97 24.28
Ben Rortvedt 1.15 23.64
Nick Fortes 0.34 23.24
Gary Sánchez 0.50 23.00
Travis D'arnaud 0.47 22.24
Sean Murphy 0.30 21.56
Brian Serven 2.54 19.44
Blake Sabol 0.59 19.16
René Pinto 0.73 17.80
Austin Barnes 0.43 15.68
Tomás Nido 0.93 13.16
Tyler Heineman 1.18 11.92
Mike Zunino 0.38 9.92
Ali Sánchez 6.04 9.48
Austin Wells 0.51 9.24
Reese Mcguire 0.17 6.56
Danny Jansen 0.12 5.68
Roberto Pérez 2.07 5.28
Omar Narváez 0.16 5.24
Alex Jackson 2.65 5.04
Dillon Dingler 3.70 4.88
Logan Porter 0.71 4.84
Will Smith 0.05 4.16
Bo Naylor 0.08 3.80
Sandy León 0.42 3.32
Anthony Bemboom 0.76 3.28
Mj Melendez 0.55 3.20
Caleb Hamilton 2.08 3.00
Andrew Knapp 8.67 2.60
Grant Koch 3.45 2.52
Payton Henry 1.72 2.20
Israel Pineda 8.17 1.88
Carlos Narvaez 7.27 1.60
Tres Barrera 2.47 1.48
Henry Davis 0.09 0.48
Austin Allen 4.36 0.48
Chris Okey 0.69 0.40
Zack Collins 0.28 0.36
Rob Brantly 1.27 0.28
Mickey Gasper 1.41 0.24
Dom Nuñez 2.67 0.16
David Bañuelos -0.00 -0.00
Hunter Goodman -0.52 -0.12
Joe Hudson -0.61 -0.20
Chuckie Robinson -0.93 -0.28
Mark Kolozsvary -0.34 -0.48
Aramis Garcia -0.60 -0.52
Jorge Alfaro -0.16 -0.64
Tyler Cropley -0.47 -0.88
Pedro Pagés -1.40 -1.12
Manny Piña -1.06 -1.32
Jhonny Pereda -9.20 -1.84
César Salazar -0.46 -2.16
Mitch Garver -0.12 -2.20
Kyle Mccann -1.28 -2.36
Drew Romo -10.21 -2.96
Korey Lee -0.17 -3.28
James Mccann -0.10 -4.08
Brian O'keefe -1.11 -4.84
Michael Pérez -3.89 -5.64
Chad Wallach -0.19 -6.48
Drew Millas -0.94 -6.68
Chadwick Tromp -1.52 -6.80
Eric Haase -0.20 -7.96
Luis Torrens -5.09 -8.24
Meibrys Viloria -4.40 -8.44
Iván Herrera -0.77 -8.52
Sam Huff -3.14 -9.68
Endy Rodríguez -0.28 -9.76
Carlos Pérez -1.34 -13.24
Curt Casali -0.66 -13.88
Gabriel Moreno -0.18 -15.28
Luis Campusano -0.50 -16.40
Matt Thaiss -0.33 -16.84
Tyler Soderstrom -1.57 -17.32
Carson Kelly -0.51 -18.32
Christian Bethancourt -0.34 -22.68
Brett Sullivan -1.48 -27.72
Yainer Diaz -0.70 -28.00
Jacob Stallings -0.56 -30.04
David Fry -3.01 -30.60
Austin Nola -1.00 -31.08
Tom Murphy -1.17 -32.20
Garrett Stubbs -1.31 -33.56
Ryan Jeffers -0.63 -35.12
Salvador Pérez -0.55 -35.52
Yan Gomes -0.52 -36.60
José Herrera -1.35 -38.56
Carlos Pérez -1.82 -45.64
Francisco Mejía -1.47 -45.76
Luke Maile -1.05 -47.04
Andrew Knizner -1.01 -50.92
Willson Contreras -0.75 -52.12
Riley Adams -1.84 -57.80
Connor Wong -0.75 -60.68
Logan O'hoppe -1.57 -65.24
Tyler Stephenson -1.13 -69.72
Elías Díaz -1.07 -96.52
Keibert Ruiz -1.06 -97.56
Shea Langeliers -1.03 -99.08
Martín Maldonado -1.41 -126.52
J. T. Realmuto -1.18 -126.96
»


White Mountains 4000 Footers

I’m trying to summit all of the 4000 foot peaks in the White Mountains. Here’s how far I’ve gotten:

»


Expected Outs Difference -- A simple team fielding metric

In this post, I’m going to outline a simple baseball team fielding metric I thought of. I’m calling it Expected Outs Difference or EOD.

Here’s the plan:

  • Obtain a data set of balls put in play
  • Train a supervised model to predict whether a given ball will become and out based on exit velocity, launch angle, and spray angle
  • Evaluate a team’s defense by running the model against each ball they faced and comparing the predicted outs with the actual ones

This method allows me to compare how well a team’s defense does at compared to what we’d expect based on how other teams do.

The data

I started by downloading a 2023 Statcast dataset using pybaseball. This gave me a table containing every pitch from the season.

from pybaseball import statcast
from pybaseball.datahelpers.statcast_utils import add_spray_angle

OUTS_EVENTS = ["field_out", "fielders_choice_out", "force_out", "grounded_into_double_play", "sac_fly_double_play", "triple_play", "double_play"]
START_DATE = "2023-03-01"
END_DATE = "2023-10-01"

data = add_spray_anglestatcast(start_dt=START_DATE, end_dt=END_DATE))

# filter out strikes and balls
balls_in_play = data[data["type"] == "X"]

# filter out foul outs
balls_in_play = balls_in_play[balls_in_play["spray_angle"].between(-45, 45)]

# label outs
balls_in_play["out"] = balls_in_play["events"].isin(OUTS_EVENTS)

# drop rows with missing data
balls_in_play = balls_in_play.dropna(subset=["launch_angle", "launch_speed", "spray_angle"])

Statcast provides a lot of information, but what I most cared about was exit velocity, launch angle, spray angle, and play outcome. Launch angle and spray angle diagram

Here’s a 2D histogram showing the fraction of the hits within bins in exit velocity and launch angle that became outs. Balls in the dark areas usually fall for hits. Balls in the yellow areas are usually outs. Exit velocity launch angle out histogram

The model

I trained an SKLearn KNeighbors classifier. To make a prediction, this model looks up the eight most similar hits. If most of those hits became outs, the model classifies it as an out. If not, the model classifies it as a non-out.

from sklearn.neighbors import KNeighborsClassifier

X = balls_in_play[["launch_speed", "launch_angle", "spray_angle"]]
y = balls_in_play["out"]
clf = KNeighborsClassifier(n_neighbors=8).fit(X, y)

balls_in_play["predicted_out"] = clf.predict(X)

This isn’t perfect. When the model is classifying a point, it will always find the point itself as one of the nearest neighbors. I don’t think this is a big issue though.

Calculating EOD

Finally, I used the model to evaluate Expected Outs Difference.

# create a column giving the id of the fielding team
balls_in_play["fielding_team"] = np.where(balls_in_play["inning_topbot"] == "Top", balls_in_play["home_team"], balls_in_play["away_team"])

# group by fielding team and sum outs and predicted outs
teams = balls_in_play.groupby("fielding_team").agg({"out": "sum", "predicted_out": "sum", "events": "count"})

# subtracting the expected outs from the predicted outs gives Expected Outs Difference
teams["out_diff"] = teams["out"] - teams["predicted_out"]
teams.sort_values("out_diff", ascending=False)
              out	   predicted_out    events      out_diff
fielding_team				
MIL           2634          2561          3947          73
CHC           2622          2578          4057          44
BAL           2719          2679          4183          40
AZ            2787          2754          4309          33
TEX           2658          2626          4072          32
TOR           2702          2672          4204          30
SEA           2609          2580          3989          29
LAD           2729          2704          4144          25
CLE           2751          2730          4231          21
ATL           2581          2566          4035          15
KC            2688          2677          4272          11
SD            2565          2558          3919          7
DET           2783          2781          4310          2
CWS           2541          2539          4002          2
PIT           2766          2764          4350          2
MIN           2673          2677          4121          -4
NYM           2610          2619          4088          -9
NYY           2771          2780          4237          -9
TB            2719          2735          4146          -16
SF            2661          2681          4178          -20
WSH           2731          2754          4367          -23
MIA           2645          2669          4139          -24
HOU           2623          2649          4061          -26
OAK           2635          2667          4234          -32
STL           2922          2962          4660          -40
PHI           2772          2821          4301          -49
LAA           2620          2685          4153          -65
BOS           2633          2698          4179          -65
CIN           2636          2717          4179          -81
COL           2864          2957          4662          -93

Evaluating EOD against other fielding metrics

This approach seems to roughly match other popular fielding metrics.

EOD vs OAA EOD vs DRS EOD vs FRV

I think my method here is most conceptually similar to OAA. Oddly, that’s the metric that EOD correlates most poorly with.

»


Calculating Run Expectancy Tables

Below is some simple code for building a run expectancy table based on Statcast data. A run expectancy table gives the average number of runs scored after each base/out state. For example, with runners on 1st and 2nd and one out, the table gives the average number of runs that scored.

import pandas as pd
from pybaseball import statcast


def run_expectancy(start_date: str, end_date: str) -> pd.Series:
    """
    Returns a run expectancy table based on Statcast data from `start_date` to `end_date`
    """
    pitch_data: pd.DataFrame = statcast(start_dt=start_date, end_dt=end_date)

    # create columns for whether a runner is on each base
    for base in ("1b", "2b", "3b"):
        pitch_data[base] = pitch_data[f"on_{base}"].notnull()

    pitch_data["inning_final_bat_score"] = pitch_data.groupby(
        ["game_pk", "inning", "inning_topbot"]
    )["post_bat_score"].transform("max")

    # filter down to one row per at-bat
    ab_data = pitch_data[pitch_data["pitch_number"] == 1]

    ab_data["runs_after_ab"] = (
        ab_data["inning_final_bat_score"] - ab_data["bat_score"]
    )

    # group by base/out state and calculate mean runs scored after that state
    return ab_data.groupby(["outs_when_up", "1b", "2b", "3b"])["runs_after_ab"].mean()

Here’s what it looks like for 2021:

print(run_expectancy("2021-04-01", "2021-12-01"))
---
outs_when_up  1b     2b     3b   
0             False  False  False    0.507303
                            True     1.393333
                     True   False    1.135049
                            True     2.107407
              True   False  False    0.916202
                            True     1.745745
                     True   False    1.523861
                            True     2.446313
1             False  False  False    0.264921
                            True     0.958691
                     True   False    0.684807
                            True     1.409165
              True   False  False    0.534543
                            True     1.126154
                     True   False    0.923244
                            True      1.68007
2             False  False  False    0.101856
                            True     0.385488
                     True   False    0.324888
                            True     0.600758
              True   False  False    0.228621
                            True     0.493186
                     True   False    0.451022
                            True     0.825928
»


Deriving FIP -- Fielding Independent Pitching

If you’re a baseball fan, you’ll probably have come across a stat called FIP or Fielding Independent Pitching. As I understand it, FIP was designed by Tom Tango to track a pitcher’s peformance in a way that doesn’t depend on fielders or on which balls happen to fall between them or get caught. Because it removes these factors that are outside of a pitcher’s control, FIP is also said to me more stable year-to-year and to be a better measure of a pitcher’s skill.

I believe the theory behind this comes from work by Voros McCracken showing that pitchers don’t have much control over the fraction of balls put in play that fall for hits. This conflicts with some baseball common sense ideas like “pitching to contact”, but let’s leave that aside for now. The fraction of balls in play that become hits is sometimes called BABIP or Batting Average on Balls In Play. By diminishing the luck-based effects of BABIP, one can build a pitching statistic that’s more indicative of a pitcher’s true ability. That’s the idea.

To accomplish this goal, FIP is based on results that pitchers have the most control over: home runs, strikeouts, walks, and hit by pitch. Also, it’s constructed to have a similar scale to ERA to make it familiar to fans. Until recently, that was about all I knew.

It’s easy to find a formula online:

\[\mathrm{FIP} = \frac{13{\mathrm{HR}} + 3(\mathrm{BB} + \mathrm{HBP}) - 2\mathrm{SO}}{\mathrm{IP}} + C_{\mathrm{FIP}}\]

where \(\mathrm{HR}\), \(\mathrm{BB}\), \(\mathrm{HBP}\), \(\mathrm{SO}\), and \(\mathrm{IP}\) are the number of home runs, walks, hit by pitch, strike outs, and innings pitched collected by the pitcher, respectively. \(C_{\mathrm{FIP}}\) is a constant which varies year by year and is apparently usually around three. The constant brings FIP into the same range as ERA.

But where do the 13, 3, -2, and constant come from though? This is a bit harder to find.

In this post, I’ll derive a version of FIP in a way that makes sense to me. I’ll touch on this at the end, but I don’t think this is the way it’s actually done.

My method is simple: I trained a linear regression model to predict a pitcher’s ERA based on \(\mathrm{HR}\), \(\mathrm{BB} + \mathrm{HBP}\) and \(\mathrm{SO}\). FIP is just a linear combination of those parameters, so it will be easy to compare the trained parameters of the linear model with the real formula.

A look at the data

I used pybaseball to download full season pitching data from 2015 to 2023.

The relationships between the raw statistics are as you’d expect. For example, here’s ERA plotted against HR and SO: Home Run Fly Ball Relationship Home runs allowed per inning pitched has a positive relationship with ERA.

Strikeout Fly Ball Relationship Strikeouts per inning has a negative relationship with ERA. Let’s move on.

Training a linear model

Training a linear model is pretty easy:

from pybaseball import pitching_stats
from sklearn.linear_model import LinearRegression

logs = pitching_stats(2021, 2023, qual=25)

logs["UBB"] = logs["BB"] - logs["IBB"] # unintentional walks is total walks minus intentional walks
logs["UBB_HBP"] = logs["UBB"] + logs["HBP"] # sum walks and hit by pitch

X = logs[["HR", "UBB_HBP", "SO"]].div(logs["IP"], axis=0) # normalize by innings pitched
y = logs["ERA"]
weight = logs["IP"]
fip_reg = LinearRegression().fit(X, y, sample_weight=weight)

Two things to point out here:

  1. I excluded pitchers with less than 25 innings pitched using the qual kwarg in the pitching_stats() call.
  2. I weighted the samples based on innings pitched using sample_weight.

The relationship between ERA and FIP looks reasonable:

ERA FIP relationship

Here are the fitted parameters:

fip_reg.coef_, fip_reg.intercept_
> (array([13.28482645,  3.2804091 , -1.78295967]), 2.8140810888761045)

So my version of FIP is

\[\mathrm{FIP} = \frac{13.3{\mathrm{HR}} + 3.3(\mathrm{BB} + \mathrm{HBP}) - 1.8\mathrm{SO}}{\mathrm{IP}} + 2.8\]

vs the regular version

\[\mathrm{FIP} = \frac{13{\mathrm{HR}} + 3(\mathrm{BB} + \mathrm{HBP}) - 2\mathrm{SO}}{\mathrm{IP}} + ~3\]

Maybe this is silly, but I was a little shocked at how closely they align. If you round my coefficients to the nearest whole number, the equations are identical.

Does FIP work?

If FIP works as intended, it should have more year-to-year predictive power than ERA. I calculated the correlation between ERA in one year and ERA in the next year with the correlation between FIP in one year and ERA in the next year. I’m comparing

\[\mathrm{corr}\left(\mathrm{ERA}_i, \mathrm{ERA}_{i+1}\right)\]

with

\[\mathrm{corr}\left(\mathrm{FIP}_i, \mathrm{ERA}_{i+1}\right)\]

Based on year to year correlations from 2015 to 2023, FIP does seem to have better year-to-year predictive power. The plot below shows a 1D scatter plot of the year-to-next-year correlations from 2015 to 2023. Correlations

xFIP

If you’ve seen FIP, you may also have come across a related stat called xFIP (the x stands for “expected”). It’s based on a familiar observation. Just like pitchers seem to not have much control over BABIP, they also don’t seem to have much control over their home run to fly ball ratio. xFIP is just like FIP except that it replaces the player’s actual home run total with an expected home run total by replacing their home run to fly ball ratio with the league average ratio:

\[\mathrm{HR} = \mathrm{FB} \left(\mathrm{HR}/\mathrm{FB}\right)\]

with

\[\mathrm{xHR} = \mathrm{FB} \left(\overline{\mathrm{HR}/\mathrm{FB}}_{\mathrm{league}} \right)\]

This has always felt a little strange to me. Now xFIP is back to being dependent on fielders and the luck of precisely where balls land.

It’s simple to implement:

league_avg_hr_fb = logs["HR"].div(logs["FB"]).mean()
logs["xHR"] = logs["FB"] * league_avg_hr_fb

X = logs[["xHR", "UBB_HBP", "SO"]].div(logs["IP"], axis=0) # normalize by innings pitched
y = logs["ERA"]
weight = logs["IP"]
xfip_reg = LinearRegression().fit(X, y, sample_weight=weight)

The relationship between (same year) xFIP and ERA seems worse than that between FIP and ERA: ERA FIP relationship

Does xFIP work?

I repeated the above experiments to look at xFIP’s predictive power. Correlations xFIP seems to have somewhat more predictive power than FIP or ERA.

How similar is this to real FIP?

At time of writing, I haven’t found a satisfying derivation of the FIP equation. As best I can tell, the coefficients are derived from a type of run value estimate of each possible plate appearance outcome called linear weights. That topic probably deserves a post of its own. However, my method found an equation that’s very similar to the real FIP equation, so maybe it doesn’t matter how it was originally calculated. If two different methods arrive at the same answer, that ought to give us extra confidence in the result.

»