Baseball by the Numbers
Last fall, I dove headfirst into the world of baseball data. I started with a simple question: how often do the Braves win in the 9th? And how often do other teams do it? As a fan, there's nothing quite like a comeback win at the last moment. It would be easy enough to answer this question once I had the data, but first I had to find it.
Luckily, baseball fans are often diligent record keepers - what's the fun of watching a game if you're not filling out your own scorecard? It turns out that several enterprising baseball fans have been publishing new and historical game data for decades, through the Retrosheet project. It's a sprawling collection of datasets that spans nearly all of America's baseball history, and exactly what I needed.
But like a lot of data, it was not in a particularly easy to use format (at least, not for me.) Luckily the Retrosheet website mentions Boxball, a collection of Docker images giving me easy access to a PostgreSQL database. That led me to this monster of a query (disclaimer: I have not optimized this for performance, because I don't know how):
select
winning_team,
count(*) as wins_after_8th
from
(
select
winning_team,
home_team,
home_score_after_eight,
visiting_team,
visitor_score_after_eight,
(
case
when (
visitor_score_after_eight > home_score_after_eight
) then visiting_team
when (
home_score_after_eight > visitor_score_after_eight
) then home_team
else 'tie'
end
) as leading_team_at_eight
from
(
select
(
case
when (home_runs_score > visitor_runs_scored) then home_team
when (visitor_runs_scored > home_runs_score) then visiting_team
else 'tie'
end
) as winning_team,
home_team,
(
select
sum(
cast(
regexp_replace(regexp_replace(A, '\(|\)|x', '', 'gi'), '', '0') as INT
)
)
from
unnest(
(
regexp_split_to_array(home_line_score, '(?![^(]*\))')
) [1:8]
) A
) as home_score_after_eight,
visiting_team,
(
select
sum(
cast(
regexp_replace(regexp_replace(A, '\(|\)|x', '', 'gi'), '', '0') as INT
)
)
from
unnest(
(
regexp_split_to_array(vistor_line_score, '(?![^(]*\))')
) [1:8]
) A
) as visitor_score_after_eight
from
retrosheet_gamelog
where
date > '2022-01-01'
) t
) d
where
winning_team != leading_team_at_eight
group by
winning_team
order by
count(*) desc;
-------------------------------
winning_team | wins_after_8th
--------------+----------------
CLE | 23
SEA | 22
NYA | 20
SLN | 19
SFN | 16
MIL | 16
PHI | 15
HOU | 14
TBA | 14
MIA | 13
TEX | 12
COL | 12
NYN | 12
SDN | 12
ARI | 11
BAL | 11
CHA | 11
TOR | 11
KCA | 10
MIN | 10
LAN | 10
CIN | 10
PIT | 9
CHN | 9
DET | 9
ANA | 8
WAS | 6
BOS | 6
OAK | 6
So the Braves stood somewhere in the middle of the pack - but Cleveland had the most since 2003, when Cincinnati had 25. Of course, neither team can match the fantastic 1999 season Atlanta had - 28, the most in at least the last 60 years.
Now that I've gotten to know the data, let's answer some other fun questions, like:
Recommended by LinkedIn
What's the biggest blowout loss in Braves history?
From here on, I'll be limiting my queries to seasons since '66, when the Braves moved to Atlanta. This query will be pretty simple - just comparing total runs and sorting:
select
*
from
(
select
(
case
when (home_runs_score > visitor_runs_scored) then visiting_team
when (visitor_runs_scored > home_runs_score) then home_team
else 'tie'
end
) as loser,
(
case
when (home_runs_score > visitor_runs_scored) then home_team
when (visitor_runs_scored > home_runs_score) then visiting_team
else 'tie'
end
) as winner,
abs(home_runs_score - visitor_runs_scored) as lost_by,
date
from
retrosheet_gamelog
where
date > '1966-01-01'
and (
home_team = 'ATL'
or visiting_team = 'ATL'
)
) d
where
loser = 'ATL'
order by
lost_by desc
limit
10;
---------------------------------------
loser | winner | lost_by | date
-------+--------+---------+------------
ATL | BAL | 21 | 1999-06-13
ATL | FLO | 19 | 2003-07-01
ATL | MON | 19 | 1978-07-30
ATL | FLO | 16 | 2003-04-05
ATL | MON | 16 | 2000-07-03
ATL | CIN | 16 | 1994-06-18
ATL | CIN | 15 | 1987-09-15
ATL | CIN | 15 | 1973-08-02
ATL | SFN | 15 | 1990-06-08
ATL | SDN | 15 | 2010-04-12
Ouch! Talk about embarrassing - losing by 21 to Baltimore in the same season the Braves would go to the World Series. Of course, that's not quite so bad as the 27 point loss those same Orioles would have against the Rangers in 2007.
What's the most wins the Braves have had in a season without going to the World Series?
This one's a little tricky - there's no indication in the gamelogs whether a game is in the regular season or postseason, and nothing to indicate when the postseason starts in a given year. So, I had to play dirty and run a query using information from outside the table (a list of the years the Braves made it to the World Series):
select
year,
count(year)
from
(
select
date_part('year', date) as year
from
retrosheet_gamelog
where
date_part('year', date) not in ('1991', '1992', '1995', '1996', '1999', '2021')
and (
(
home_team = 'ATL'
and home_runs_score > visitor_runs_scored
)
or (
visiting_team = 'ATL'
and visitor_runs_scored > home_runs_score
)
)
) d
group by
year
order by
count(year) desc;
--------------
year | count
------+-------
1998 | 111
1993 | 106
1997 | 106
2003 | 103
2002 | 103
2022 | 102
2019 | 99
2004 | 98
2013 | 97
2000 | 95
Feel free to comment below with other fun queries to run, or (I may regret asking for this) ways to improve the performance of my queries!
This is amazing