Algo FPL: September 2020

8. Gameweek 3 Preview: Man vs Machine - Neck and neck after a Miserable Week

There is just one point in between my team and the Machine Learning Team after GW 2. And as we start to learn more about the teams and players who are likely to hit form, so does the algorithm. Here's a brief summary of how it did this past week.

Algo Team

Ings and Antonio returning the faith despite price falls
TAA and Robbo with the double clean sheet, although lucked out with Alisson's penalty save
Harvey Barnes - the real hero in a week that saw some really high individual scores
Mitchell coming on for Azpilicueta as autosub with 5 valuable points

Captaincy blank yet again - perhaps we need a completely different model for the captaincy but I'm sticking to it for now mainly due to a lack of free time
Doherty and Azpilicueta disappointing for the 2nd week in a row bringing big at the back into question
Mccarthy looked completely out of rhythm and Southampton just mentally collapsed in the second half against Spurs

Gameweek 3

With 2 free transfers and having Azpilicueta not playing enough, 3 transfers have been made for a -4
Vardy goes out for Calvert Lewin - both of them are expected to get similar returns according to the algo but Calvert Lewin is almost 3 million cheaper
De Bruyne comes in for Willian - KdB is the second best midfielder for the next 4 weeks and Willian regresses with the upcoming fixtures
Saiss in for Azpilicueta - looks like Wolves' great run of fixtures helps Saiss make the cut
Antonio stays despite a harsh fixture run - he still has one of the best underlying stats and the opponents he faces have not been as defensively solid as it seems on paper

My Team

Not much to smile about honestly
Justin and Taylor the budget defensive heroes who allowed the score to cross 40
Auba with a decent return, although nothing exceptional
Trent with a clean sheet and bonus point off the mark for the season
Mitchell coming off the bench with points for the second week in a row

Gameweek rank of 5.5 million!
Dele Alli has been a huge disappointment when compared to his teammate, Son.
Mccarthy's and Saint Maximin's existence in my team looks like a burden
Adams unlucky for the second week in a row to not score, or maybe he's just a bad finisher
Martial came in for Jimenez as a defensive move. Now Jimenez looks like a great asset with his upcoming fixture run. It's quite hard to make the difficult moves week on week even if in the long term it may seem obvious.

Gameweek 3

Alli went out for James Rodriguez on Saturday evening, with a wildcard being a real possibility
KdB in for Auba on Monday night for a -4. I was briefly considering keeping Auba but after watching de Bruyne against Wolves, it would take a brave man to go without him
Decided against the wildcard in the end. I'm still not quite sure who I would get if I had to play it now
Hoping that Vinagre gets some game time with Marcal's injury, otherwise next week can easily be as damaging as this one

And this is how the table looks after 2 weeks. My team still keeps the lead, but only just about. The machine will surely be glad with how it has managed to close down the gap in just a week. Hopefully both of us have a better GW 3 and beyond.

7. Gameweek 2 Preview: Man vs Machine - Solid Start to the Season

It was quite an up and down week for the machine learning based Algo Team. A GW 1 score of 66 is not something to feel upset about (Do machines have feelings anyway? Is that an advantage or a disadvantage?) but it could have been quite an amazing scoreline if the captaincy pick was the right one. My own team did better with 84 points. The difference in scores between Man and Machine was exactly the difference between a Salah captain and an Antonio captain.

Algo Team

The first game of the season saw Willian bag 3 assists and look extremely lively. He was unfortunate not to come away with a goal, hitting the post with his free kick.
Vardy did absolutely nothing for the entire game but came away with a brace thanks to his two penalties
Mo Salah and his hattrick but more on that later

Captaincy: Perhaps it's the error metric or the captaincy model itself but I feel that the fact that Antonio was picked by it was a gamble in itself. Personally I feel it is safer to go with one of the big boys who have a favourable fixture, but I guess I'm not a machine
Big at the back: Not a great start for the 4 big defenders. Doherty came close to a goal and Robbo got an assist but not much to speak of for the rest of them

Gameweek 2:

The model predicts Jimenez and Wan Bissaka to do slightly better than Ings and Doherty in the next 4 games but not by much and not in the very next fixture anyway, so it will be holding it's transfer
The captaincy model has picked Salah (chasing points and a week too late to the party?)
Steer comes in to replace Mccarthy in goal

My Team

Mo Salah - absolute legend. He looked really sharp and my plan to get rid of him has been quickly shelved
Justin and Mitchell with important points providing great value. Vinagre for staying on the bench for 90 mins.
Werner looked good, Auba got his goal and Jimenez looked as reliable as ever.

Dele Alli got subbed at half time and looks a doubt for the near future - but you can't have it all in one week.
Che Adams was unlucky not to score but at 6 million still looks a decent shout.
TAA lacked rhythm, both in attack and defence.

Gameweek 2:

Reluctantly got rid of Jimenez to get Martial with the 0.5 in the bank to stick to the plan of prioritising the next immediate gameweek over long term plans. I hope to still get Jimenez for his great run of fixtures.
Alli is a doubt - if he doesn't look likely to play, Greenwood is on my radar. Maybe James Rodriguez too.
Auba captain - home to a weak West Ham team

6. Gameweek 1 Preview: Final Machine Learning Model and Picking a Team to Start the Season

So here we are. At the start of a new season. After some experimentation and feature selection, I've landed on a model that is being used to pick the GW 1 team. The image below shows the number of returns each player that is playing in GW 1 is expected to get in the first 4 weeks by position.

Since Part 5, there have been a few changes to the features of the model and these changes are reflected in the team that has been selected.

The clean sheets are now being predicted by a model that has data points of teams instead of players. This means that we are looking at a more accurate model and it has reduced the training as well as test error.
The features that are being used for predicting attacking returns were short term & long term goals and assists, short term & long term xGI and short term opponent xGC. Long term xGC and number of goals conceded by opponents did not prove to have a good correlation in predicting the attacking returns.
The features that are being used for predicting defensive returns were short term and long term xGC, long term clean sheets and short & long term goals scored by opposition. xG of opposition and goals conceded by the team did not end up having a good correlation in predicting the defensive returns.
While we continue to look back at 4 gameweeks as short term form for both attacking and defensive returns, the long term performance that we consider before those 4 gameweeks is 16 gameweeks for attackers and 14 gameweeks for defenders.
Promoted teams now have historic data that is assigned to them. Currently, this is a mix of their performance in the Championship along with average performance of teams that have been relegated last season from the Premier League. For this purpose, I have compared Leeds, West Brom and Fulham to Bournemouth, Watford and Norwich respectively.
Similarly, new players' statistics are a mix of their statistics from their previous team (where stats are available) and players in a similar position who played for the same team last year. For example, Timo Werner's stats are a mix of his Bundesliga performance and the underlying statistics of Abraham and Giroud.
A captaincy model has also been run with prediction window as 1 gameweek instead of 4, which had Michael Antonio narrowly edging Mo Salah out as the predicted captain.

Some comments on the model and its results as a human observer:

Since premium defenders are likely to get clean sheets as well as attacking returns, the model seems to favour them over mid priced midfielders. However, in reality, FPL managers tend to pick a healthy mix of the two categories in order to have a higher ceiling of the points.
The model loves Michael Antonio, which concerns me. Perhaps it is his extraordinary underlying stats in the post lockdown period of the game but hopefully that is more of a strength than weakness.
Willian is so highly predicted because he was on setpieces and penalties at Chelsea, which may not be the case at Arsenal.
I did not pick any Man United and Man City players even if they featured in the returns for the first 4 gameweeks as they do not have a fixture in Gameweek 1.
The model predicts number of returns and not points. So it isn't a completely accurate prediction of how many points the team will get. But in order to be able to compare across positions in a reasonable manner and give the model a good chance of getting it right, I chose this as the label.
It's still quite difficult to maximise the number of returns in the team for Gameweek 1 (a kind of knapsack problem). Hopefully it will get easier while making transfers because there will only be 1 or 2 moves to make.
Finally, the way I went about picking the team from the above spreadsheet is to try maximising the number of returns in the team. I did this manually and strictly followed the model. The rule of thumb I'm following is to pick the team the model suggests while removing players who are not expected to play. When a player is even slightly likely to play, I will pick the player that the model has chosen.

So here are the troops for GW 1 barring any last minute team news that indicates one of these players won't be starting:

5. First Draft of Gameweek 1 Team Using Linear Regression

This is the draft of the team obtained from the first run of the Machine Learning model. Let's first take a look at the results of the model and the process of how this team was picked, then at what the model exactly was and finally, potential improvements and changes we can make before the GW 1 deadline.

Results of the Model

Here's a glimpse of the output that I obtained from the model for GW 1-4. On the left, we have a list of all players in descending order of number of expected attacking returns. And on the right, we have a list of defenders and goalkeepers in the descending order of number of expected attacking + defensive returns.

The coefficients for the various features on the attacking and defensive returns is shown below. The top half is used to predict the number of goals+assists in the next 4 games while the bottom half is for the number of clean sheets in the next 4 games. The higher the coefficient value, the more likely it is to get an attacking return or a clean sheet based on that feature. Based on these values, we can tell that long term xGI is most likely to predict a player's attacking performance in the next 4 games.

If the correlation is extremely low or if it doesn't make logical sense, we can remove it as we play around with the model. Ideally, we want a high positive or a high negative correlation for the features and not much correlation between each other.

The team above was picked with goals+assists, xGI and opponent xGC for the attacking model and xGC, clean sheets and opponent goals scored for the defensive model.

As a sanity check, I took a look at the 5 players least likely to score in the first 4 weeks. Since they were all goalkeepers, it is a reasonable result to expect.

The current error metric is root mean squared error (more on that below) but what better way to evaluate a machine learning team than comparing it with its human counterpart. I've entered it into a mini-league against my own team. And probably like all parents, I hope that this child of mine is much more successful than me (and hopefully not because I have a disastrous season)

Dataset Completion

At the end of Part 4, we left off with the next 4 opponents' statistics needing to be included as part of the features in the data set. We accomplish this by using the short term and long term functions that we defined earlier to go through the previous games of the next opponents and include those features as part of player features.

Below is a sample datapoint in the dataset before it was cleaned into attacking and defensive datasets with the relevant features for the model. The example below is for Mo Salah in the 2018 season for GW 18. Features 0-2 are the details of the player and the gameweek. Features 3-35 are various features of the player, team and upcoming opposing teams in both the short term and long term. The final three columns are the labels: goals, assists and clean sheets.

The dataset now has 28705 entries over 3 seasons, each entry having 38 features (which also just happens to be the number of gameweeks in the entire season). The next step was to extract only the ones we think are relevant for the attacking and defensive predictions. As we decided in Part 2, the plan was to include the following features for each of the models. Since we have 38 features for each entry, this is something that can easily be changed and played around with.

Once we extract the relevant features, the attacking and defensive datasets look as shown below. The last column is the number of returns. The defensive dataset does not just include defenders and goalkeepers - it has all players. The actual values were all converted to a single fixture in a single gameweek value so that the coefficients obtained could be compared among one another.

Running the Model

This data was then divided into 80% training data and 20% test data. X is the feature set, Y is the label and info contains the player name, gameweek and season should we want to access that.

Finally, the linear regression model was fit on the training data and used to predict the test data. Any changes to the model can now be compared based on the error we get on the test data.

The entire dataset formation was repeated for the player data using the new fixtures of the 2020-21 season. This results in a similar dataset, but without the label as we don't know the next 4 games' results yet. And finally, we got the list of attackers and defenders in descending order of expected number of returns in the next 4 gameweeks by using the same model, the results of which are at the top of this post.

Areas for Improvement

1. Players may not have played all the previous games in short term and long term history. How do we tackle this?

We could try dividing the features by number of starts + subs so that it doesn't matter, although this may defeat the purpose of getting long term data in case a player has played very little.

2. Long term features for new players and promoted teams long term features don't exist.

Dividing long term features by number of starts + subs could help here too so that by GW 5, the players will have comparable data. At the same time, the model has examples in previous seasons of new players and promoted teams so it should be able to learn that exceptional performance in the short term is valuable. An alternative is to include their performance from their previous league but that would be too tedious and unnecessary. For now, we have to accept that the model will probably not pick a Werner or a Leeds player for the first couple of weeks. Let's hope it doesn't set us back too much.

3. Old players with new clubs will need to have correct data.

Willian and Doherty still have their Chelsea and Wolves statistics as part of their GW 1 features. So this will have to be updated for the various transfers within the Premier League.

4. Positions of players can be added so that only the defenders and keepers show up in clean sheet list

5. How to optimize team after getting points

Currently, it's a manual process after the predictions. However, it may not be the most optimal or highest scoring team picked from the list of predictions.

6. Some players' club name in training data is still wrong.

This is the same player transfer issue that was mentioned in Part 4

7. Captaincy pick

A version of the same model to predict returns for a single gameweek as opposed to the next 4 is needed for this. The reason I chose 4 gameweeks for this model was to reduce the randomness.

8. Error metric

As mentioned above, root mean squared error is the metric being used. However, a more custom error function could work well. For example, if the model predicts that Salah will score 7 and he scores 10 in the next 4 games, it's not as bad as it predicting that someone scores 3 and them scoring 0.

4. Reading FPL Data into Python and Building the Dataset

After scraping the data from Fantasy Football Scout in Part 3, we have files that are of the following structures.

Player Data

The first file is that of individual player data from each gameweek. ind-2018-4 refers to 2018 season and gameweek 4. This is a csv file with each row having the player name, the team, goals scored, starts, subbed on, assists, xA and xG data. The 2019-20 tables in the members area show price even if the custom table doesn't ask for it so it needed a slight modification while scraping.

Team Data

The second file is that of team data from each gameweek. team-2018-17 here refers to the 2018 season and gameweek 17. This csv file has team name, goals scored, goals conceded, xG conceded and xG of the team

Fixtures Data

The last file that we have is the fixtures.csv file which has a list of all the fixtures from 2018-20 season in the following format. Season, gameweek, home team and away team. This data was obtained from https://members.fantasyfootballscout.co.uk/matches/ on Fantasy Football Scout.

Reading Fixtures Data

We now have to read this data into our code. The first step is to be able get the opponent of a team with the season, team and gameweek. The code below reads the fixtures.csv file to create a dictionary where we can obtain the opponent of a team by the season, team and gameweek. If it's a double gameweek, it will return both the opponents. You can see the results for Man United 2018 GW 30 and Arsenal 2020 in DGW 30 below.

Reading Team Data

The next step is to read the team stats and get the particular stats that represent the team for that gameweek. The features that are relevant here are the goals scored, goals conceded, xG and xG conceded for a particular team in a particular gameweek. Below is the result obtained for Man United in GW 27. We get a list with the relevant features returned.

Combining Player and Team Data

The final step of reading data is to form the feature vector of the player that includes his own stats, his team's stats as well as the opposing team's stats for that week. So this gives us a data structure of the form below that takes in the season, player name and gameweek and gives out the player goals, starts, subs, assists, xG, xA and also the team's goals, xG, conceded and xGC.

The example of 2019, Aguero and GW 32 is given below.

Following this loading of data into our code, we end up with a player stats data structure that looks like this. Every season has a list of players and every player has a list of gameweeks with data about what he did that gameweek, how his team did and how the opposing team did.

Building the Dataset

The following are two helper functions that help getting features from a particular window. getFeaturesFrom allows us to get features from a particular gameweek to another. Again the example of Aguero from GW 3 in 2018 to GW 10 in 2019 is given below and we get a combined feature list of how he did in those weeks and how Man City did and what the opposing teams' statistics were in those games.

The termRange function is used to get the start and end gameweek from a given number of weeks as range.

This is where we finally construct our dataset (with one placeholder). Using a short term range of 4 weeks and a long term range of the 15 weeks before the 4 weeks we construct a feature set, which has been shuffled and shown below.

The placeholder here is the team stats feature, which is currently the cumulative of the team stats of all the previous oppositions the player has faced in those games. This will be changed to the next opposition's previous games using the range and fixtures data structure.

In the next post, we will replace the placeholder team stats with the correct team stats and we will finally train this dataset and perform some predictive tasks.

Known issues

There is an issue in Fantasy Football Scout data where if a player moves from one FPL club to another in the middle of a season, his club is shown as the final one even for his time before the move. For example, Alexis Sanchez moved to Manchester United from Arsenal in Jan 2018 but the data shows Sanchez's club as Man United even in the start of the 2017-18 season. I have corrected this for a number of players that I identified but I will still need to do verify if I've missed out on anyone. The following are the players that I have identified and corrected so far.

2018 - sanchez, mkhitaryan, lennon, giroud, barkley, nkodou

2019 - fosu mensah, niasse