What is Data Science?

Tim Youell

Published May 21, 2021

In this article I hope to display the power of data science and machine learning and how it can be utilised by individuals or companies to speed up processes and make data-driven decisions. A week ago I decided to start a mini project to demonstrate what I do and develop on a day-to-day basis. Organisations are increasingly using and collecting larger amounts of data during their everyday operations. From predicting what customers will buy to using image recognition to locate tumours, a data scientists job is to use data to find patterns and help solve the problems faced by businesses in innovative and imaginative ways.

For the last few years I have been competing in Fantasy Premier League (FPL) football. The goal is to select the best 15 players which you hope will score the most points and finish highest out of the 8.2 million players currently competing. For the last year or two I've joked with my friends about developing a machine learning algorithm to select the best team each week in order to help me decide player picks and this year was the year I thought I'd finally give it a go. The aim was to build a forecasting model, in Python, to predict the optimal fantasy football squad for the final matches of the season, Gameweek 38. Anyway, how hard can it be?

The basis of FPL is as follows.

You are competing to select the best players within the available budget of a theoretical £100m to score the most points in each Gameweek. Every player has a certain value which can increase or decrease over the season depending on form and how many teams have selected that specific player.
There is a total of 38 Gameweeks, spanning the length of the Premier League season.
Each team must contain 15 players, 11 starters with 4 on the bench which are made up of 2 keepers, 5 defenders, 5 midfielders and 3 attackers. Only specific formations may be used meaning you can't play with no defenders, for example. You can readily transfer players in and out, provided you have the required funds available. You get one free transfer a Gameweek. If you want to use another, it will cost you 4 points for each transfer after the first.
Each Gameweek you pick a captain and vice-captain. The captained player will receive double points for that Gameweek. If your captain doesn't play, double points are rewarded to the vice-captain.
You cannot just fill your team with players from one individual club; there is a maximum of 3 players from any given Premier League club.
Based off the players position, a player can gain points by registering goals, assists, clean sheets, saves, minutes played and bonus points. Players can lose points by scoring an own goal, getting a yellow or red card, or conceding more than 1 goal.
There are various extra features known as 'chips' which can be utilised to hopefully positively impact your squad. These include wildcard which allows you to completely reset your team (you only get to do this twice a season) and triple captain (triple points for your captain that Gameweek).

The target variable (the thing we want to predict) for this project is therefore the points a player will receive. I wanted to train an model using historic football points data, to predict future points.

1. Dataset build through FPL API

In order to access the FPL data, there is an Application Programming Interface (API) which can be accessed in order to receive a JSON structure in return. The response can be easily converted into readily accessible datasets using Python, which will be used as the models training data. Python code is an open-source object-oriented programming language popular with computer scientists due to its versatility.

The URL for the API is 'https://fantasy.premierleague.com/api/'. Combining this with various resources, data can be retrieved such as player, team and fixture information. Utilising other URL endpoints and Python loops, a dataset can be built for every player's historic data. You cannot train a model on data that has not happened yet so the inputs must have already happened.

The above is a sample code function that collates historic data for each player. I've created new columns which merges the statistics of the player from the Gameweek before. The new columns are denoted with a '_FPGW' prefix, which stands for From Previous GameWeek. the idea was to hopefully encapsulate each players form.

For instance, the column player_total_point_FPGW is the total points that a specific player achieved in the previous Gameweek. I've used a snippet of Patrick Bamford (Leeds) data to demonstrate this above.

2. Data Exploration

With the dataset now built it is time to explore the total points target variable. Below is a histogram of all the points achieved by players in each Gameweek. A histogram is a great tool to use to quickly observe the distribution of the data.

As you can see the data is a massive skewed towards zero total points. This could be due to players registered in the FPL game but have since moved clubs or have never started a premier league fixture this season, such as Arsenal's Mesut Ozil. In order to fix this, the data was filtered so that a player must have played more than 90 minutes of football over the course of the season. It is required to do this as it would create unnecessary noise in the training data which would heavily impact the model and therefore predictions.

The new plot shows the data with playing players is slightly better now, but still not great. This shows there is potential to transform the data to give a more 'normal' distribution.

Because there is a set budget, we need to spend that money wisely in our player picks. If you use the stock market as an example, we want to analyse the stocks (players in our case) to achieve the greatest Return On Investment (ROI). The aim is to find the best individual players who return a great ROI. To help us locate these players who fit this narrative this season, we can use a simple scatterplot.

Observing this plot of player cost against points, we should ideally be looking at picking players who appear to the most top-left as possible. These players have the greatest ROI so far as they score more points per million cost. The more expensive 'premium' players can be observed to the top-right. These are your Salah, Kane, Kevin De-Bruyne and Mane players who will all set you back upwards of £10m. It's important to have these players as they most often score the most points (hence the hefty price tag) but you cannot afford too many of them within the required budget. This is why picking players from the top-left of the scatterplot is vital. We want to avoid players who are overpriced in comparison to their performance. To observe this better we can use a table sorted by ROI or in this case, points per million.

The above table shows a snippet of the top 20 players sorted by their value. It is interesting to observe that 7 out of the top 10 in this list all play for newly promoted clubs (Leeds, West Brom and Fulham). This is a great insight into next season, as the newly promoted players are often priced lower as they have little Premier League experience. You should be already looking at Norwich and Watford value assets for your squad next season, for great ROIs. It is also interesting to see 3 out of the top 4 ROI players in this table are goalkeepers. This is backed up with the pivot table below.

This table shows the average points per million for each player position. It demonstrates that this season goalkeepers and defenders hold the most value and are certainly areas you should look to be saving funds. Both Martinez (Aston Villa) and Meslier (Leeds) began the season costing a mere £4.5m. When you compare that with the more expensive keepers of Ederson (Man City, £6m) and Alisson (Liverpool, £6m) they only score 28th and 57th on the value table respectively while potentially costing you £3m more in total.

The next feature to observe is points per minute played. Similar to points per million this looks at the players who, when they play, return very consistent FPL points. I've only included players who have played more than 500 minutes of football this season.

The table shows us the following:

Gareth Bale (Tottenham) has been injured a lot of the season but his impact is huge when he does play. Scoring an FPL point on average sooner than every 10 minutes of play.
What has now been coined 'Pep Roulette', the constant changing of the Man City starting line-up has meant that players are well rested for Champions League nights and Premier League ties respectively. It's been a massive factor in City doing so well this season. However, it does make it very difficult to know which players are going to play in any given Gameweek. Players such as Foden, Gundogan, Mahrez and Torres all feature in the list but do not consistently start.
As you'd expect the big 'premium' options are pretty high up here (Fernandes, Kane, Salah and Son).
Special mention to Jesse Lingard who having rarely made an appearance at Man Utd over the past few seasons has gone on loan to West Ham and has been in amazing form, as proved by his points_per_minute score.

3. Model Building

In Lehman's terms, an algorithm is a sequence of computer implementable instructions or rules used to solve a problem. Machine learning allows this set of rules to be determined by 'learning' a dataset, creating a model that can be used to predict future outputs to a degree of certainty.

'Machine learning is a data analytics technique that teaches computers to do what comes naturally to humans and animals: learn from experience.'

In this project I decided to use an eXtreme Gradient Boosting (XGBoost) Regressor algorithm from scikit-learn (a very useful Python package import). First released in 2014, XGBoost is a versatile algorithm that has been dominating applied machine learning structured data competitions (such as the famous Netflix $1m model to predict recommended films and shows to the user). The algorithm is an implementation of gradient boosting decision trees designed for speed and performance. I won't go into the details but XGBoost is trained to accurately predict a target by combining estimates of weak or simple learners, learning from each iteration in a decision tree way to reach an optimal model for a dataset.

I started by divided the data in a 70:30 train/test data split and used the following regressor and parameters as the model to train with.

Predicting on the test data gives the following R2 result and plot.

R-squared is known as the coefficient of determination and is a scoring metric used to determine how good of a fit two sets of data are to each other. The score is calculated from the resulting regression line. The score ranges from 0 to 1, where 1 means the model explains all the variability of the target around its mean and 0 is a completely random model which does not explain it well. In general, the higher the R-squared score, the greater the model fits your data. A score of 0.205 is therefore not amazing. I do however feel that a high score would be relatively hard to achieve due to the complete randomness of football as a sport. I have chosen R-squared as a metric because it is easy to understand for this specific case. If I had more time I would be looking at more scoring metrics, such as mean average error and root means squared error.

4. Team Selection

Using the trained model, the final Gameweek of data can be put through the model to predict which players are the optimal for selection.

This is the team the model has predicted. The output shows which players should start and which should be benched along with their cost and predicted points (pp) for Gameweek 38. The algorithm predicts a total score of 62 for this team. Considering the average Gameweek score this season is just above 50 points, the team would do very well to achieve this number. Some points to consider:

This team is way below the £100m set value which shows the model needs some work instantly. I would always consider spending up to or close to £100m price budget. With this in mind I would strongly consider purchasing another 'premium' option. Salah and Kane are already in the team but another wouldn't go a miss. I would strongly consider McNeil to Manchester United Bruno Fernandes, who currently ranks number 1 for all FPL points scored.
I understand why these players have been picked because the algorithm is obviously slightly skewed towards points in the previous Gameweek. Every one of these players have done well over the last few matches but some I do not see as long-term options if this was only halfway through the season. Such as Højbjerg who plays a central defensive midfield position which means he rarely picks up goals and assists.
This proves that the model certainly needs more work. Because I did this in a week of evenings after employment commitments, I feel I have barely scratched the surface. I feel it is a good starting point to move forward and developed.

5. Evaluation

There are a huge amount of things that can be done to improve the algorithm and process. For instance, more data exploration and feature engineering would be required to optimise a better team selector. I haven't even started to look at expected goals or assists of players. I have only used one algorithm to begin researching the task at hand. I haven't explored any tuning or optimisation or even any ensembling methods as of yet as I simply didn't have time to get this article out before the start of the final Gameweek of the season.

6. Conclusion

All in all, I wanted to work out how difficult it would be to build a FPL team predictor machine learning algorithm. It is fair to say it was genuinely much more difficult to build a model than I thought it would be. There is certainly a lot more work that can be done to improve it, that's for sure. However, I started this project and article to show the power that data analytics and machine learning can have in a relatively short space of time in solving business problems. I hope I have achieved this outcome. I currently work at Aviva, a leading insurer in the UK and utilise this sort of work to help Aviva achieve their business goals and help automate processes. I'm looking forward to spending some time over the summer enhancing and improving what I already have to begin the 2021-22 season on the front-foot.

My code can be found at: https://github.com/TimYouell15/fantasy_football/tree/main

Any tips or improvements for this project please don't hesitate to drop me a message. I've been a data scientist for over 3 years now but I'm always looking to further my development and learn new techniques. I also do not take responsibility if you chose to captain Westwood on the back of viewing this article and he does not score well. I am personally going to be sticking the armband on Salah.

Thank you for reading.

Tim Youell | Senior Data Scientist | Aviva Insurance

Dayne Demelo 3y

Dillon De Melo

1 Reaction

Harry Vining 4y

Edward Thompson this is what we need

Shyam Samani 4y

Viraj Vaitha how not to come last next season - you're welcome 🙂

Daniel Frean 4y

"In (Jens) Lehmann's terms" 😂

1 Reaction

See more comments

To view or add a comment, sign in

What is Data Science?

Tim Youell

Others also viewed

What "the multicollinearity" ...

checking the delivery

Data Science Quick Tips #001: Reversing One Hot Encoding!

The Eye Test: How to Find Conditional Probabilities Using Multi-Dimensional Arrays

📈 Statistical Models vs. Traditional ML for Time Series Forecasting: Striking the Right Balance! 🤖📊

Data Science: Not just Math Destruction

From Simplicity to Complexity: Why More Variables Mean More Than Just Math in Data Science

ML - Pipeline

Data Science: Seeing the Light

Explore content categories