As the new Premier League season comes closer and closer, the Machine Learning team has gone through a serious revamp to challenge my team. I have tried to be as comprehensive as possible and address as many of the Machine's drawbacks and disadvantages from last season to allow it to do better this year. Here's a sneak peek of the first set of predictions made by the Machine.
I will now go through my thought process and the various steps of implementation that I went through. Before that, another set of predictions from the ML model in a different view
The first and the most important change is that the Machine will now be predicting points instead of number of returns and will be doing so individual match wise, instead of a few gameweeks together. This will handle the issues with DGWs and BGWs calculations.
We'll need a dataset that has date, team, player. So for every match, we must predict points for every player that plays for the teams involved.
We also need to get the short term form and long term quality data of each player.
We also need the form and quality of the opponent they are facing (We can also choose to use form and quality of the teams they recently played but not in the roadmap for now)
A datapoint will likely look like this, although only a subset of these features will probably have correlation:
Player, Position, Recent form (xG, xA, G, A, FPL points), Long term quality (xG, xA, G, A, FPL points), Team, Recent form (xG, xGC, G, GC), Long term quality (xG, xGC, G, GC), Opponent, Recent form (xG, xGC, G, GC), Long term quality (xG, xGC, G, GC), Label will be that date's FPL Points (based on position and Goals + assists + cs + bonus)
3.1 Data Scraping
For every game in the PL from 2017-18 to 2020-21, I scraped match data from Fantasy Football Scout: date, teams, goals, underlying stats, expected data and result. A total of 3040 data points from 1070 matches.
Then, for every player who played in those games, I scraped their basic info, position, price, underlying stats, expected data, all FPL related outcomes, again from Ffscout data. For every player who missed a penalty in the last four seasons, I scraped data from Ffscout's Sky data. This was a total of 41934 data points, each of which had 25 features.
3.2 Feature Calculation
3.2.1 Team Features
For every team in every game, I calculated recent (4 gameweeks) and long term (15 gameweeks before that) data - xG, xGC, goals scored, conceded, clean sheets. If a team didn't have long term data in the PL - like a promoted team, I used a lookalike team that was promoted the previous season. Here are some examples:
{'team_name': 'WOL', 'date': '05-14-2018', 'short_term': 4, 'long_term': 0, 'xg': 3.17, 'xg_c': 7.62, 'goals': 4, 'goals_c': 4, 'cs': 1, 'lookback_count': 4}
{'team_name': 'CRY', 'date': '09-23-2017', 'short_term': 4, 'long_term': 15, 'xg': 1.07, 'xg_c': 1.5, 'goals': 0, 'goals_c': 3, 'cs': 0, 'lookback_count': 1}
{'team_name': 'CHE', 'date': '01-01-2021', 'short_term': 4, 'long_term': 0, 'xg': 7.51, 'xg_c': 3.67, 'goals': 6, 'goals_c': 6, 'cs': 1, 'lookback_count': 4}
{'team_name': 'MUN', 'date': '05-23-2021', 'short_term': 3, 'long_term': 12, 'xg': 17.85, 'xg_c': 10.42, 'goals': 21, 'goals_c': 9, 'cs': 5, 'lookback_count': 12}
3.2.2 Player Features
For every player in every game I calculated recent and long term data - mins, goals, assists, cs, goals conceded, own goals, yellow & red cards, xa, xg, bps, saves, pen saves, pen misses, bonus, points. If he didn't have long term data - then I used a lookalike player, which I computed based on mins, price, position and team (lookalike team where needed). Again, here are some examples:
{'player_name': 'Harry Kane', 'date': '01-01-2018', 'short_term': 6, 'long_term': 0, 'mins': 540, 'goals': 8, 'assists': 0, 'cs': 2, 'goals_c': 8, 'own_goals': 0, 'y_c': 1, 'r_c': 0, 'xa': 0.29000000000000004, 'xg': 6.29, 'bps': 201, 'saves': nan, 'pen_saves': nan, 'pen_miss': 0, 'bonus': 8, 'points': 51, 'lookback_count': 6}
{'player_name': 'Mohamed Salah', 'date': '05-14-2021', 'short_term': 0, 'long_term': 152, 'mins': 11870, 'goals': 94, 'assists': 39, 'cs': 61, 'goals_c': 115, 'own_goals': 0, 'y_c': 3, 'r_c': 0, 'xa': 27.590000000000007, 'xg': 81.38999999999997, 'bps': 2864, 'saves': nan, 'pen_saves': nan, 'pen_miss': 1, 'bonus': 91, 'points': 1010, 'lookback_count': 142}
{'player_name': 'David De Gea', 'date': '12-14-2017', 'short_term': 2, 'long_term': 6, 'mins': 540, 'goals': 0, 'assists': 0, 'cs': 2, 'goals_c': 5, 'own_goals': 0, 'y_c': 0, 'r_c': 0, 'xa': 0.0, 'xg': 0.0, 'bps': 137, 'saves': 32.0, 'pen_saves': 0.0, 'pen_miss': 0, 'bonus': 4, 'points': 31, 'lookback_count': 6}
{'player_name': 'Alexis Sánchez', 'date': '01-01-2021', 'short_term': 20, 'long_term': 0, 'mins': 877, 'goals': 1, 'assists': 5, 'cs': 2, 'goals_c': 10, 'own_goals': 0, 'y_c': 3, 'r_c': 0, 'xa': 2.6900000000000004, 'xg': 1.49, 'bps': 170, 'saves': nan, 'pen_saves': nan, 'pen_miss': 0, 'bonus': 1, 'points': 47, 'lookback_count': 20}
3.2.3 Label Calculation
Also, for each player I also added his own team's recent and long term data and the recent and long term data of the team he's playing. I then calculated the bonus and FPL points for each game and that was used as the label, giving me a more comprehensive dataset
3.3 Lookalike Teams and Players
I added mock data points for the lookalike players to calculate the stats. I eventually filtered this dataset to only include real data points but these mock data points help in the calculation of long term stats for players that don't have them. At the end of this, we get 50584 data points in the dataset.
3.4 Feature Engineering
I created four datasets for the four positions in FPL. For each position, I performed feature engineering to see correlation between the features and the outcome. Then, I checked correlation between the features themselves and also used some practical FPL knowledge to finalise the list of features
3.4.1 Goalkeepers
For goalkeepers this was "team_recent_xg_c", "team_long_xg_c", "opponent_recent_xg", "opponent_long_goals"
3.4.2 Defenders
For defenders this was "recent_xa", "recent_points", "long_xg", "long_xa", "long_assists", "long_points", "team_recent_xg", "team_recent_xg_c", "team_long_xg", "team_long_xg_c", "team_long_cs", "opponent_recent_xg", "opponent_recent_xg_c", "opponent_long_xg", "opponent_long_xg_c", "opponent_long_cs"
3.4.3 Midfielders
For midfielders this was "recent_xg", "recent_xa", "recent_assists", "recent_points", "recent_cs", "long_xg", "long_xa", "long_assists", "long_cs", "team_recent_xg", "team_long_xg", "team_long_xg_c", "opponent_long_goals_c", "opponent_long_xg"
3.4.4 Forwards
For forwards this was "recent_xg", "recent_xa", "recent_points", "long_xa", "long_assists", "long_points", "team_recent_goals", "team_long_xg", "opponent_long_xg_c"
3.5 Model Selection
I split each dataset into 80-20 train-test and tried out linear regression, bayesian ridge, SVM regression, tree regression, Ada boost, Gradient boosting, Random forest, MLPR and voting regression that uses various combinations of the models and evaluated it on the test dataset. I experimented with various error metrics: MSE, MAE and R-2 and I also used a custom metric that looks at accuracy with an absolute point deviation of 2. This was done for all models in each position.
In the end, a voting regression that combined linear regression, gradient boosting and MLPR proved the best for gk, fwd and mids. For defenders, random forest was doing fairly better but I opted for the same voting regression combination to keep the model same for all positions.
3.6 Creating Test Dataset
I pulled the fixtures and player data for the first four gameweeks from the FPL API. I corrected those player names that had foreign accents in them so that they would be identified as the same players
For all new players and those players that did not have long term data, a lookalike player from the past was selected algorithmically to mimic long term data
I created datasets that were of the same form as the training data without the FPL points as label
3.7 Predictions
I used the previously existing four models to predict points in each fixture for every player in the first four gameweeks.
I created a starters list that consists of a shortlist of players who are likely to play (no bench fodder keepers, 4.0 defenders etc)
I created a list of rotation risk players like Cancelo, Mahrez, etc. They can be ignored or picked based on settings. For each player, created a grid from the individual predictions that consisted of basic info, predictions and his image
For now, there is not much left to do as far as the algorithm is concerned. I will review the lookalike players that have been created to see if they make sense and wait for all transfers to take place before running it again prior to the deadline.
If I have the time over the next 10 days, I will work on a linear programming optimiser but I'm not sure I will be able to include all the complexities necessary to take all possible decisions in time. So I'll probably continue to do the greedy approach I used last season to pick the team from projections.