Baseball by the Numbers

Last fall, I dove headfirst into the world of baseball data. I started with a simple question: how often do the Braves win in the 9th? And how often do other teams do it? As a fan, there's nothing quite like a comeback win at the last moment. It would be easy enough to answer this question once I had the data, but first I had to find it.

Luckily, baseball fans are often diligent record keepers - what's the fun of watching a game if you're not filling out your own scorecard? It turns out that several enterprising baseball fans have been publishing new and historical game data for decades, through the Retrosheet project. It's a sprawling collection of datasets that spans nearly all of America's baseball history, and exactly what I needed.

But like a lot of data, it was not in a particularly easy to use format (at least, not for me.) Luckily the Retrosheet website mentions Boxball, a collection of Docker images giving me easy access to a PostgreSQL database. That led me to this monster of a query (disclaimer: I have not optimized this for performance, because I don't know how):

select
    winning_team,
    count(*) as wins_after_8th
from
    (
        select
            winning_team,
            home_team,
            home_score_after_eight,
            visiting_team,
            visitor_score_after_eight,
            (
                case
                    when (
                        visitor_score_after_eight > home_score_after_eight
                    ) then visiting_team
                    when (
                        home_score_after_eight > visitor_score_after_eight
                    ) then home_team
                    else 'tie'
                end
            ) as leading_team_at_eight
        from
            (
                select
                    (
                        case
                            when (home_runs_score > visitor_runs_scored) then home_team
                            when (visitor_runs_scored > home_runs_score) then visiting_team
                            else 'tie'
                        end
                    ) as winning_team,
                    home_team,
                    (
                        select
                            sum(
                                cast(
                                    regexp_replace(regexp_replace(A, '\(|\)|x', '', 'gi'), '', '0') as INT
                                )
                            )
                        from
                            unnest(
                                (
                                    regexp_split_to_array(home_line_score, '(?![^(]*\))')
                                ) [1:8]
                            ) A
                    ) as home_score_after_eight,
                    visiting_team,
                    (
                        select
                            sum(
                                cast(
                                    regexp_replace(regexp_replace(A, '\(|\)|x', '', 'gi'), '', '0') as INT
                                )
                            )
                        from
                            unnest(
                                (
                                    regexp_split_to_array(vistor_line_score, '(?![^(]*\))')
                                ) [1:8]
                            ) A
                    ) as visitor_score_after_eight
                from
                    retrosheet_gamelog
                where
                    date > '2022-01-01'
            ) t
    ) d
where
    winning_team != leading_team_at_eight
group by
    winning_team
order by
    count(*) desc;

-------------------------------
 winning_team | wins_after_8th
--------------+----------------
 CLE          |             23
 SEA          |             22
 NYA          |             20
 SLN          |             19
 SFN          |             16
 MIL          |             16
 PHI          |             15
 HOU          |             14
 TBA          |             14
 MIA          |             13
 TEX          |             12
 COL          |             12
 NYN          |             12
 SDN          |             12
 ARI          |             11
 BAL          |             11
 CHA          |             11
 TOR          |             11
 KCA          |             10
 MIN          |             10
 LAN          |             10
 CIN          |             10
 PIT          |              9
 CHN          |              9
 DET          |              9
 ANA          |              8
 WAS          |              6
 BOS          |              6
 OAK          |              6        

So the Braves stood somewhere in the middle of the pack - but Cleveland had the most since 2003, when Cincinnati had 25. Of course, neither team can match the fantastic 1999 season Atlanta had - 28, the most in at least the last 60 years.

Now that I've gotten to know the data, let's answer some other fun questions, like:

What's the biggest blowout loss in Braves history?

From here on, I'll be limiting my queries to seasons since '66, when the Braves moved to Atlanta. This query will be pretty simple - just comparing total runs and sorting:

select
    *
from
    (
        select
            (
                case
                    when (home_runs_score > visitor_runs_scored) then visiting_team
                    when (visitor_runs_scored > home_runs_score) then home_team
                    else 'tie'
                end
            ) as loser,
            (
                case
                    when (home_runs_score > visitor_runs_scored) then home_team
                    when (visitor_runs_scored > home_runs_score) then visiting_team
                    else 'tie'
                end
            ) as winner,
            abs(home_runs_score - visitor_runs_scored) as lost_by,
            date
        from
            retrosheet_gamelog
        where
            date > '1966-01-01'
            and (
                home_team = 'ATL'
                or visiting_team = 'ATL'
            )
    ) d
where
    loser = 'ATL'
order by
    lost_by desc
limit
    10;
---------------------------------------
 loser | winner | lost_by |    date
-------+--------+---------+------------
 ATL   | BAL    |      21 | 1999-06-13
 ATL   | FLO    |      19 | 2003-07-01
 ATL   | MON    |      19 | 1978-07-30
 ATL   | FLO    |      16 | 2003-04-05
 ATL   | MON    |      16 | 2000-07-03
 ATL   | CIN    |      16 | 1994-06-18
 ATL   | CIN    |      15 | 1987-09-15
 ATL   | CIN    |      15 | 1973-08-02
 ATL   | SFN    |      15 | 1990-06-08
 ATL   | SDN    |      15 | 2010-04-12        

Ouch! Talk about embarrassing - losing by 21 to Baltimore in the same season the Braves would go to the World Series. Of course, that's not quite so bad as the 27 point loss those same Orioles would have against the Rangers in 2007.

What's the most wins the Braves have had in a season without going to the World Series?

This one's a little tricky - there's no indication in the gamelogs whether a game is in the regular season or postseason, and nothing to indicate when the postseason starts in a given year. So, I had to play dirty and run a query using information from outside the table (a list of the years the Braves made it to the World Series):

select
    year,
    count(year)
from
    (
        select
            date_part('year', date) as year
        from
            retrosheet_gamelog
        where
            date_part('year', date) not in ('1991', '1992', '1995', '1996', '1999', '2021')
            and (
                (
                    home_team = 'ATL'
                    and home_runs_score > visitor_runs_scored
                )
                or (
                    visiting_team = 'ATL'
                    and visitor_runs_scored > home_runs_score
                )
            )
    ) d
group by
    year
order by
    count(year) desc;
-------------- 
year  | count
------+-------
 1998 |   111
 1993 |   106
 1997 |   106
 2003 |   103
 2002 |   103
 2022 |   102
 2019 |    99
 2004 |    98
 2013 |    97
 2000 |    95        

Feel free to comment below with other fun queries to run, or (I may regret asking for this) ways to improve the performance of my queries!

To view or add a comment, sign in

More articles by Justin Ebert

  • Implementing and Evaluating Gen-AI Coding Tools

    For sixty years computer programmers have considered our labor as something beyond what other people do. We're…

  • Simulating 3D Motion With Sine Waves

    Check out this post on my blog, where all the animations are generated dynamically and there's limited interactivity I…

  • Minecraft Mapping with Lidar Data

    A few weeks ago, I had an idea. I was going to use lidar datasets to create 3D interactive maps of parks and trails…

    2 Comments

Others also viewed

Explore content categories