Copa America through a Python Pryzm
In case you didn't notice, two major soccer championships, American Copa America and European UEFA Cup, leave us soccer fans little or no time to rest. Every day is full of games, and every game is a surprise.
But are all surprises created equal? Are there any results that are more unexpected than others? If you are a serious fan, you probably know the answer. But if you are not (and, admit it, most of us "go soccer" only during major international events!), you can still use your formal analytical and programming skills.
Enter statistics and Python (or any other decent programming language).
Let's have a look at the group part of the championship, where 16 teams are arranged into four groups of four (N=4), and each team must meet (and hopefully beat) the other three teams. The total number of games in each group is N(N-1)/2=6. A team gets three points for beating another team, one point for a draw, and zero points for being a loser. Two teams with the highest total scores in each group advance to the next round, which is beyond the scope of this exercise.
Let's assume that the outcome of each game is equally likely to be "win," "loss" or "draw." In the absence of any "insider" information, this is probably the most plausible assumption. Then any six-element sequence of results is equally probable, too (and there are 3^6=729 such sequences). For each sequence we can calculate the final score in the group. For example, if all six games ended in draws, then each team got 3 (1+1+1) points, and this is the only way for all teams to get the same number of points. If one team won all games, an other won two, yet another won one, and the last team won nothing, the scores are 9, 6, 3, and 0, respectively. Since any of the commands can be the total winner, and any of the remaining three can be in the second place, and any of the remaining two can claim the third place, there are 4*3*2=24 ways of ending up with the 9-6-3-0 score. In other words, this score is 24 times more likely than 3-3-3-3, and we should see it more often. If we don't, the original claim is wrong, and the outcomes of the games are not random. In other words, some teams are more likely to lose, draw or win.
Let's calculate the probability of every possible score and find out which scores were actually reported. All we need is to count how many sequences, of 729, contribute to each score. Surely, we won't do this by hand.
At the top of the Python program, let's import all good modules and define useful and self-explanatory constants:
import itertools, numpy as np, pandas as pd
N = 4; WIN = 3; LOSS = 0; DRAW = 1
Function itertools.product(x) takes several collections and calculates a Cartesian product of them. For example, product((0,1), (0,1), (0,1)) yields [(0,0,0), (0,0,1), (0,1,0), (0,1,1), (1,0,0), ...]. In our case, each inner collection is an array of possible game outcomes (encoded as two-element tuples of points earned by each command for each outcome), and there are 6 identical inner collections in the outer collection, one per game.
outcomes = itertools.product(*((N * (N - 1) // 2) * ([(LOSS, WIN), (DRAW, DRAW), (WIN, LOSS)], )))
The middle "multiplication" sign * is indeed in charge of numeric multiplication. The last * replicates the inner collection of outcomes six times. The first * unpacks the argument list, so that the function product() sees six arguments instead of one. (It expects each collection to be a separate argument.) A collection of outcomes may look like this:
((0, 3), (0, 3), (3, 0), (3, 0), (1, 1), (0, 3))
This simply means that Team 1 (T1) lost to T2 and T3, but won over T4; T2 won over T3 and had a draw with T4; and T3 lost to T4. Now, we need a function that converts a collection of outcomes into final group scores.
We will first use NumPy, a powerful Python numerical engine, to create a N by N (4x4) zero matrix. The outcomes o_seq get converted to a NumPy array of scores for the first and second teams in each game:
array([[0, 0, 3, 3, 1, 0], [3, 3, 0, 0, 1, 3]])
These scores are pasted in the upper and lower triangular parts of the matrix using NumPy smart indexing and Python simultaneous assignment. Finally, we calculate row (or column) sums, sort them to find out who's the worst and who's the bets, and convert the result to a immutable tuple:
def outcomes2points(o_seq):
u = np.zeros([N, N], dtype=int)
u[np.triu_indices(N, 1)], u[np.tril_indices(N, -1)] = np.array(o_seq).T
return tuple(sorted(u.sum(axis=0)))
The function is applied to each possible outcome collection. Some calculated point tuples are, of course, the same.
points = [outcomes2points(outcome) for outcome in outcomes]
We'll use pandas to calculate tuples frequencies: convert the list of points to a pandas Series, count unique rows, and sort and normalize frequencies. Pandas can count unique rows only if they are immutable - that's why we had to convert them to tuples above!
probs = pd.Series(points, name="p").value_counts().sort_values() / len(points)
Here's the top and the bottom of the new Series:
(3, 3, 3, 3) 0.001372
(0, 0, 9, 9) 0.002743
(0, 2, 5, 9) 0.005487
(2, 3, 3, 7) 0.005487
(1, 1, 6, 9) 0.005487
Name: p, dtype: float64
(1, 3, 6, 7) 0.038409
(1, 2, 5, 7) 0.054870
(2, 4, 4, 5) 0.054870
(3, 4, 4, 6) 0.054870
(1, 4, 4, 7) 0.076818
Name: p, dtype: float64
The least likely game group score is 3-3-3-3 (we already knew that!), the most likely - 1-4-4-7 (four won games and two draws).
The actual data is take from the official Copa America site and arranged into a self-documented data frame:
final = pd.DataFrame({"group" : list("ABCD"), "points" : [tuple(sorted(p)) for p in ((6, 6, 4, 1), (7, 5, 4, 0), (7, 7, 3, 0), (9, 6, 3, 0))]})
(We want to be sure that the final scores are sorted in the same order as the keys in the series.) And we are ready to look up the likelihoods of these scores:
actual = final.merge(probs[final["points"]].reset_index(), on="points").set_index("group").sort_values("p")
Can't wait to see the answers?
points p
group
C (0, 3, 7, 7) 0.005487
B (0, 4, 5, 7) 0.010974
D (0, 3, 6, 9) 0.027435
A (1, 4, 6, 6) 0.032922
The result in Group C is the most improbable (p~0.5%). If teams Mexico, Venezuela, Uruguay, and Jamaica played randomly, it would be hard for them to end up with these final scores.