Machine Learning in FPL 2021-22: Planning, Strategy & Execution

1. Introduction

As the new Premier League season comes closer and closer, the Machine Learning team has gone through a serious revamp to challenge my team. I have tried to be as comprehensive as possible and address as many of the Machine's drawbacks and disadvantages from last season to allow it to do better this year. Here's a sneak peek of the first set of predictions made by the Machine.



I will now go through my thought process and the various steps of implementation that I went through. Before that, another set of predictions from the ML model in a different view



2. Planning

The first and the most important change is that the Machine will now be predicting points instead of number of returns and will be doing so individual match wise, instead of a few gameweeks together. This will handle the issues with DGWs and BGWs calculations. 


2.1 Dataset

We'll need a dataset that has date, team, player. So for every match, we must predict points for every player that plays for the teams involved. 

We will need the team's short term form and long term quality data. 

We also need to get the short term form and long term quality data of each player. 

We also need the form and quality of the opponent they are facing (We can also choose to use form and quality of the teams they recently played but not in the roadmap for now)


2.2 Data Point

A datapoint will likely look like this, although only a subset of these features will probably have correlation:

Player, Position, Recent form (xG, xA, G, A, FPL points), Long term quality (xG, xA, G, A, FPL points), Team, Recent form (xG, xGC, G, GC), Long term quality (xG, xGC, G, GC), Opponent, Recent form (xG, xGC, G, GC), Long term quality (xG, xGC, G, GC), Label will be that date's FPL Points (based on position and Goals + assists + cs + bonus)


2.3 Strategy

We'll look n gameweeks ahead - for example, four. 

The initial plan was to have 4 sets of training data and 4 models - Model 1 will look at just the next game (one game away), model 2 will look at just the game that is two games away, model 3 will look at just the game that is three games away and model 4 will look at the game that is 4 games away. 

However, during experimentation, I realised quickly that this did not help the performance too much and didn't make too much sense either. At a given time, we can only make decisions based on the data we have - even if the game may not be the very next one.

2.4 Data Gathering

Past Datasets Available
Player, team, position match datewise xG, xA, Goals, Assists, Bonus, FPL points (scraping from Fantasy Football Scout)
Expected data: xG and xA
Fantasy data: Cost, Mins, Goals, Total Assists, CS, GC, OG, YC, RC
Keeping data: Pen saves, Saves
Bonus data: BPS
Pen Miss: Sky sports data from Fantasy Football Scout

Team, match datewise xG, xGC, goals, goals conceded - Four seasons of data from Fantasy Football Scout

Present Datasets needed
i) Player, position, team, price - FPL API package in Python
ii) Datewise fixtures - FPL API package in Python

3. Process

3.1 Data Scraping

For every game in the PL from 2017-18 to 2020-21, I scraped match data from Fantasy Football Scout: date, teams, goals, underlying stats, expected data and result. A total of 3040 data points from 1070 matches.



Then, for every player who played in those games, I scraped their basic info, position, price, underlying stats, expected data, all FPL related outcomes, again from Ffscout data. For every player who missed a penalty in the last four seasons, I scraped data from Ffscout's Sky data. This was a total of 41934 data points, each of which had 25 features.



3.2 Feature Calculation

3.2.1 Team Features

For every team in every game, I calculated recent (4 gameweeks) and long term (15 gameweeks before that) data - xG, xGC, goals scored, conceded, clean sheets. If a team didn't have long term data in the PL - like a promoted team, I used a lookalike team that was promoted the previous season. Here are some examples:


{'team_name': 'WOL', 'date': '05-14-2018', 'short_term': 4, 'long_term': 0, 'xg': 3.17, 'xg_c': 7.62, 'goals': 4, 'goals_c': 4, 'cs': 1, 'lookback_count': 4}
{'team_name': 'CRY', 'date': '09-23-2017', 'short_term': 4, 'long_term': 15, 'xg': 1.07, 'xg_c': 1.5, 'goals': 0, 'goals_c': 3, 'cs': 0, 'lookback_count': 1}


{'team_name': 'CHE', 'date': '01-01-2021', 'short_term': 4, 'long_term': 0, 'xg': 7.51, 'xg_c': 3.67, 'goals': 6, 'goals_c': 6, 'cs': 1, 'lookback_count': 4}
{'team_name': 'MUN', 'date': '05-23-2021', 'short_term': 3, 'long_term': 12, 'xg': 17.85, 'xg_c': 10.42, 'goals': 21, 'goals_c': 9, 'cs': 5, 'lookback_count': 12}

3.2.2 Player Features

For every player in every game I calculated recent and long term data - mins, goals, assists, cs, goals conceded, own goals, yellow & red cards, xa, xg, bps, saves, pen saves, pen misses, bonus, points. If he didn't have long term data - then I used a lookalike player, which I computed based on mins, price, position and team (lookalike team where needed). Again, here are some examples:


{'player_name': 'Harry Kane', 'date': '01-01-2018', 'short_term': 6, 'long_term': 0, 'mins': 540, 'goals': 8, 'assists': 0, 'cs': 2, 'goals_c': 8, 'own_goals': 0, 'y_c': 1, 'r_c': 0, 'xa': 0.29000000000000004, 'xg': 6.29, 'bps': 201, 'saves': nan, 'pen_saves': nan, 'pen_miss': 0, 'bonus': 8, 'points': 51, 'lookback_count': 6}


{'player_name': 'Mohamed Salah', 'date': '05-14-2021', 'short_term': 0, 'long_term': 152, 'mins': 11870, 'goals': 94, 'assists': 39, 'cs': 61, 'goals_c': 115, 'own_goals': 0, 'y_c': 3, 'r_c': 0, 'xa': 27.590000000000007, 'xg': 81.38999999999997, 'bps': 2864, 'saves': nan, 'pen_saves': nan, 'pen_miss': 1, 'bonus': 91, 'points': 1010, 'lookback_count': 142}


{'player_name': 'David De Gea', 'date': '12-14-2017', 'short_term': 2, 'long_term': 6, 'mins': 540, 'goals': 0, 'assists': 0, 'cs': 2, 'goals_c': 5, 'own_goals': 0, 'y_c': 0, 'r_c': 0, 'xa': 0.0, 'xg': 0.0, 'bps': 137, 'saves': 32.0, 'pen_saves': 0.0, 'pen_miss': 0, 'bonus': 4, 'points': 31, 'lookback_count': 6}


{'player_name': 'Alexis Sánchez', 'date': '01-01-2021', 'short_term': 20, 'long_term': 0, 'mins': 877, 'goals': 1, 'assists': 5, 'cs': 2, 'goals_c': 10, 'own_goals': 0, 'y_c': 3, 'r_c': 0, 'xa': 2.6900000000000004, 'xg': 1.49, 'bps': 170, 'saves': nan, 'pen_saves': nan, 'pen_miss': 0, 'bonus': 1, 'points': 47, 'lookback_count': 20}

3.2.3 Label Calculation

Also, for each player I also added his own team's recent and long term data and the recent and long term data of the team he's playing. I then calculated the bonus and FPL points for each game and that was used as the label, giving me a more comprehensive dataset

3.3 Lookalike Teams and Players

I added mock data points for the lookalike players to calculate the stats. I eventually filtered this dataset to only include real data points but these mock data points help in the calculation of long term stats for players that don't have them. At the end of this, we get 50584 data points in the dataset.



3.4 Feature Engineering

I created four datasets for the four positions in FPL. For each position, I performed feature engineering to see correlation between the features and the outcome. Then, I checked correlation between the features themselves and also used some practical FPL knowledge to finalise the list of features

3.4.1 Goalkeepers

For goalkeepers this was "team_recent_xg_c", "team_long_xg_c", "opponent_recent_xg", "opponent_long_goals"



3.4.2 Defenders

For defenders this was "recent_xa", "recent_points", "long_xg", "long_xa", "long_assists", "long_points", "team_recent_xg", "team_recent_xg_c", "team_long_xg", "team_long_xg_c", "team_long_cs", "opponent_recent_xg", "opponent_recent_xg_c", "opponent_long_xg", "opponent_long_xg_c", "opponent_long_cs"



3.4.3 Midfielders

For midfielders this was "recent_xg", "recent_xa", "recent_assists", "recent_points", "recent_cs", "long_xg", "long_xa", "long_assists", "long_cs", "team_recent_xg", "team_long_xg", "team_long_xg_c", "opponent_long_goals_c", "opponent_long_xg"



3.4.4 Forwards

For forwards this was "recent_xg", "recent_xa", "recent_points", "long_xa", "long_assists", "long_points", "team_recent_goals", "team_long_xg", "opponent_long_xg_c"



3.5 Model Selection

I split each dataset into 80-20 train-test and tried out linear regression, bayesian ridge, SVM regression, tree regression, Ada boost, Gradient boosting, Random forest, MLPR and voting regression that uses various combinations of the models and evaluated it on the test dataset. I experimented with various error metrics: MSE, MAE and R-2 and I also used a custom metric that looks at accuracy with an absolute point deviation of 2. This was done for all models in each position.


In the end, a voting regression that combined linear regression, gradient boosting and MLPR proved the best for gk, fwd and mids. For defenders, random forest was doing fairly better but I opted for the same voting regression combination to keep the model same for all positions.





3.6 Creating Test Dataset

I pulled the fixtures and player data for the first four gameweeks from the FPL API. I corrected those player names that had foreign accents in them so that they would be identified as the same players







For all new players and those players that did not have long term data, a lookalike player from the past was selected algorithmically to mimic long term data




I created datasets that were of the same form as the training data without the FPL points as label



3.7 Predictions

I used the previously existing four models to predict points in each fixture for every player in the first four gameweeks. 









I created a starters list that consists of a shortlist of players who are likely to play (no bench fodder keepers, 4.0 defenders etc)




I created a list of rotation risk players like Cancelo, Mahrez, etc. They can be ignored or picked based on settings. For each player, created a grid from the individual predictions that consisted of basic info, predictions and his image 

4. Conclusion and Next Steps

For now, there is not much left to do as far as the algorithm is concerned. I will review the lookalike players that have been created to see if they make sense and wait for all transfers to take place before running it again prior to the deadline.


If I have the time over the next 10 days, I will work on a linear programming optimiser but I'm not sure I will be able to include all the complexities necessary to take all possible decisions in time. So I'll probably continue to do the greedy approach I used last season to pick the team from projections.