Copa America through a Python Pryzm

Dmitry Zinoviev

Published Jun 20, 2016

In case you didn't notice, two major soccer championships, American Copa America and European UEFA Cup, leave us soccer fans little or no time to rest. Every day is full of games, and every game is a surprise.

But are all surprises created equal? Are there any results that are more unexpected than others? If you are a serious fan, you probably know the answer. But if you are not (and, admit it, most of us "go soccer" only during major international events!), you can still use your formal analytical and programming skills.

Enter statistics and Python (or any other decent programming language).

Let's have a look at the group part of the championship, where 16 teams are arranged into four groups of four (N=4), and each team must meet (and hopefully beat) the other three teams. The total number of games in each group is N(N-1)/2=6. A team gets three points for beating another team, one point for a draw, and zero points for being a loser. Two teams with the highest total scores in each group advance to the next round, which is beyond the scope of this exercise.

Let's assume that the outcome of each game is equally likely to be "win," "loss" or "draw." In the absence of any "insider" information, this is probably the most plausible assumption. Then any six-element sequence of results is equally probable, too (and there are 3^6=729 such sequences). For each sequence we can calculate the final score in the group. For example, if all six games ended in draws, then each team got 3 (1+1+1) points, and this is the only way for all teams to get the same number of points. If one team won all games, an other won two, yet another won one, and the last team won nothing, the scores are 9, 6, 3, and 0, respectively. Since any of the commands can be the total winner, and any of the remaining three can be in the second place, and any of the remaining two can claim the third place, there are 4*3*2=24 ways of ending up with the 9-6-3-0 score. In other words, this score is 24 times more likely than 3-3-3-3, and we should see it more often. If we don't, the original claim is wrong, and the outcomes of the games are not random. In other words, some teams are more likely to lose, draw or win.

Let's calculate the probability of every possible score and find out which scores were actually reported. All we need is to count how many sequences, of 729, contribute to each score. Surely, we won't do this by hand.

At the top of the Python program, let's import all good modules and define useful and self-explanatory constants:

import itertools, numpy as np, pandas as pd

N = 4; WIN = 3; LOSS = 0; DRAW = 1

Function itertools.product(x) takes several collections and calculates a Cartesian product of them. For example, product((0,1), (0,1), (0,1)) yields [(0,0,0), (0,0,1), (0,1,0), (0,1,1), (1,0,0), ...]. In our case, each inner collection is an array of possible game outcomes (encoded as two-element tuples of points earned by each command for each outcome), and there are 6 identical inner collections in the outer collection, one per game.

outcomes = itertools.product(*((N * (N - 1) // 2) * ([(LOSS, WIN), (DRAW, DRAW), (WIN, LOSS)], )))

The middle "multiplication" sign * is indeed in charge of numeric multiplication. The last * replicates the inner collection of outcomes six times. The first * unpacks the argument list, so that the function product() sees six arguments instead of one. (It expects each collection to be a separate argument.) A collection of outcomes may look like this:

((0, 3), (0, 3), (3, 0), (3, 0), (1, 1), (0, 3))

This simply means that Team 1 (T1) lost to T2 and T3, but won over T4; T2 won over T3 and had a draw with T4; and T3 lost to T4. Now, we need a function that converts a collection of outcomes into final group scores.

We will first use NumPy, a powerful Python numerical engine, to create a N by N (4x4) zero matrix. The outcomes o_seq get converted to a NumPy array of scores for the first and second teams in each game:

array([[0, 0, 3, 3, 1, 0], [3, 3, 0, 0, 1, 3]])

These scores are pasted in the upper and lower triangular parts of the matrix using NumPy smart indexing and Python simultaneous assignment. Finally, we calculate row (or column) sums, sort them to find out who's the worst and who's the bets, and convert the result to a immutable tuple:

def outcomes2points(o_seq):
u = np.zeros([N, N], dtype=int)
u[np.triu_indices(N, 1)], u[np.tril_indices(N, -1)] = np.array(o_seq).T
return tuple(sorted(u.sum(axis=0)))

The function is applied to each possible outcome collection. Some calculated point tuples are, of course, the same.

points = [outcomes2points(outcome) for outcome in outcomes]

We'll use pandas to calculate tuples frequencies: convert the list of points to a pandas Series, count unique rows, and sort and normalize frequencies. Pandas can count unique rows only if they are immutable - that's why we had to convert them to tuples above!

probs = pd.Series(points, name="p").value_counts().sort_values() / len(points)

Here's the top and the bottom of the new Series:

(3, 3, 3, 3) 0.001372
(0, 0, 9, 9) 0.002743
(0, 2, 5, 9) 0.005487
(2, 3, 3, 7) 0.005487
(1, 1, 6, 9) 0.005487
Name: p, dtype: float64

(1, 3, 6, 7) 0.038409
(1, 2, 5, 7) 0.054870
(2, 4, 4, 5) 0.054870
(3, 4, 4, 6) 0.054870
(1, 4, 4, 7) 0.076818
Name: p, dtype: float64

The least likely game group score is 3-3-3-3 (we already knew that!), the most likely - 1-4-4-7 (four won games and two draws).

The actual data is take from the official Copa America site and arranged into a self-documented data frame:

final = pd.DataFrame({"group" : list("ABCD"), "points" : [tuple(sorted(p)) for p in ((6, 6, 4, 1), (7, 5, 4, 0), (7, 7, 3, 0), (9, 6, 3, 0))]})

(We want to be sure that the final scores are sorted in the same order as the keys in the series.) And we are ready to look up the likelihoods of these scores:

actual = final.merge(probs[final["points"]].reset_index(), on="points").set_index("group").sort_values("p")

Can't wait to see the answers?

points p
group
C (0, 3, 7, 7) 0.005487
B (0, 4, 5, 7) 0.010974
D (0, 3, 6, 9) 0.027435
A (1, 4, 6, 6) 0.032922

The result in Group C is the most improbable (p~0.5%). If teams Mexico, Venezuela, Uruguay, and Jamaica played randomly, it would be hard for them to end up with these final scores.

To view or add a comment, sign in

Copa America through a Python Pryzm

Dmitry Zinoviev

More articles by Dmitry Zinoviev

Others also viewed

Oregon Duck Pushup Challenge

Generating a Round Robin Match Scheduling using Python

Understanding the Python concept, Classes & Inheritance | Using Liverpool Football Club Squad list as a reference.

CREATING GAMES WITH PYTHON (PYGAME)

Using Python's Enum for Clearer Code

crazy-pilot-python-pygame

Plot millions of points in Python 30x quicker

Running a script via a GUI in Python, part 1

PIP'n Ain't Easy: A Quick Install Guide to openspace

Cricket_game by python

Explore content categories

More articles by Dmitry Zinoviev

Biased Data and How to Deal with It

Preppers, and What Inspires Them

Once you learn how to see networks...

Know Python? Now, Learn Complex Network Analysis!

Analyzing Cultural Domains with Python

Data Science Essentials in Python

The Panama Papers. An Exercise in Network Analysis

Networks of Music Groups as Success Predictors

RUB and Brent

Of Groupon, Water, and Wine