After scraping the data from Fantasy Football Scout in Part 3, we have files that are of the following structures.
Player Data
The first file is that of individual player data from each gameweek. ind-2018-4 refers to 2018 season and gameweek 4. This is a csv file with each row having the player name, the team, goals scored, starts, subbed on, assists, xA and xG data. The 2019-20 tables in the members area show price even if the custom table doesn't ask for it so it needed a slight modification while scraping.
Team Data
The second file is that of team data from each gameweek. team-2018-17 here refers to the 2018 season and gameweek 17. This csv file has team name, goals scored, goals conceded, xG conceded and xG of the team
Fixtures Data
The last file that we have is the fixtures.csv file which has a list of all the fixtures from 2018-20 season in the following format. Season, gameweek, home team and away team. This data was obtained from https://members.fantasyfootballscout.co.uk/matches/ on Fantasy Football Scout.
Reading Fixtures Data
We now have to read this data into our code. The first step is to be able get the opponent of a team with the season, team and gameweek. The code below reads the fixtures.csv file to create a dictionary where we can obtain the opponent of a team by the season, team and gameweek. If it's a double gameweek, it will return both the opponents. You can see the results for Man United 2018 GW 30 and Arsenal 2020 in DGW 30 below.
Reading Team Data
The next step is to read the team stats and get the particular stats that represent the team for that gameweek. The features that are relevant here are the goals scored, goals conceded, xG and xG conceded for a particular team in a particular gameweek. Below is the result obtained for Man United in GW 27. We get a list with the relevant features returned.
Combining Player and Team Data
The final step of reading data is to form the feature vector of the player that includes his own stats, his team's stats as well as the opposing team's stats for that week. So this gives us a data structure of the form below that takes in the season, player name and gameweek and gives out the player goals, starts, subs, assists, xG, xA and also the team's goals, xG, conceded and xGC.
The example of 2019, Aguero and GW 32 is given below.
Following this loading of data into our code, we end up with a player stats data structure that looks like this. Every season has a list of players and every player has a list of gameweeks with data about what he did that gameweek, how his team did and how the opposing team did.
Building the Dataset
The following are two helper functions that help getting features from a particular window. getFeaturesFrom allows us to get features from a particular gameweek to another. Again the example of Aguero from GW 3 in 2018 to GW 10 in 2019 is given below and we get a combined feature list of how he did in those weeks and how Man City did and what the opposing teams' statistics were in those games.
The termRange function is used to get the start and end gameweek from a given number of weeks as range.
This is where we finally construct our dataset (with one placeholder). Using a short term range of 4 weeks and a long term range of the 15 weeks before the 4 weeks we construct a feature set, which has been shuffled and shown below.
The placeholder here is the team stats feature, which is currently the cumulative of the team stats of all the previous oppositions the player has faced in those games. This will be changed to the next opposition's previous games using the range and fixtures data structure.
In the next post, we will replace the placeholder team stats with the correct team stats and we will finally train this dataset and perform some predictive tasks.
There is an issue in Fantasy Football Scout data where if a player moves from one FPL club to another in the middle of a season, his club is shown as the final one even for his time before the move. For example, Alexis Sanchez moved to Manchester United from Arsenal in Jan 2018 but the data shows Sanchez's club as Man United even in the start of the 2017-18 season. I have corrected this for a number of players that I identified but I will still need to do verify if I've missed out on anyone. The following are the players that I have identified and corrected so far.
2018 - sanchez, mkhitaryan, lennon, giroud, barkley, nkodou
2019 - fosu mensah, niasse
hello! Thank you so much for documenting your work. I'd love to do something similar next season, so I'm reviewing your work as a reference. I do have a couple questions behind your rationale. I'm pretty new to this stuff, so apologies if they're very basic.
ReplyDelete1) What is the logic behind calculating short term/long term features across seasons?
2) Why 4 GWs for short/15 GWs for long?
3) Is there an argument for using a dataset of GW feature vectors, rather than vectors with aggregated short term/long term form?
4) For long term, do you have any concerns about calculating form across season?