This is the draft of the team obtained from the first run of the Machine Learning model. Let's first take a look at the results of the model and the process of how this team was picked, then at what the model exactly was and finally, potential improvements and changes we can make before the GW 1 deadline.
Results of the Model
Here's a glimpse of the output that I obtained from the model for GW 1-4. On the left, we have a list of all players in descending order of number of expected attacking returns. And on the right, we have a list of defenders and goalkeepers in the descending order of number of expected attacking + defensive returns.
.
The coefficients for the various features on the attacking and defensive returns is shown below. The top half is used to predict the number of goals+assists in the next 4 games while the bottom half is for the number of clean sheets in the next 4 games. The higher the coefficient value, the more likely it is to get an attacking return or a clean sheet based on that feature. Based on these values, we can tell that long term xGI is most likely to predict a player's attacking performance in the next 4 games.
If the correlation is extremely low or if it doesn't make logical sense, we can remove it as we play around with the model. Ideally, we want a high positive or a high negative correlation for the features and not much correlation between each other.
The team above was picked with goals+assists, xGI and opponent xGC for the attacking model and xGC, clean sheets and opponent goals scored for the defensive model.
As a sanity check, I took a look at the 5 players least likely to score in the first 4 weeks. Since they were all goalkeepers, it is a reasonable result to expect.
The current error metric is root mean squared error (more on that below) but what better way to evaluate a machine learning team than comparing it with its human counterpart. I've entered it into a
mini-league against my own team. And probably like all parents, I hope that this child of mine is much more successful than me (and hopefully not because I have a disastrous season)
Dataset Completion
At the end of
Part 4, we left off with the next 4 opponents' statistics needing to be included as part of the features in the data set. We accomplish this by using the short term and long term functions that we defined earlier to go through the previous games of the next opponents and include those features as part of player features.
Below is a sample datapoint in the dataset before it was cleaned into attacking and defensive datasets with the relevant features for the model. The example below is for Mo Salah in the 2018 season for GW 18. Features 0-2 are the details of the player and the gameweek. Features 3-35 are various features of the player, team and upcoming opposing teams in both the short term and long term. The final three columns are the labels: goals, assists and clean sheets.
The dataset now has 28705 entries over 3 seasons, each entry having 38 features (which also just happens to be the number of gameweeks in the entire season). The next step was to extract only the ones we think are relevant for the attacking and defensive predictions. As we decided in
Part 2, the plan was to include the following features for each of the models. Since we have 38 features for each entry, this is something that can easily be changed and played around with.
Once we extract the relevant features, the attacking and defensive datasets look as shown below. The last column is the number of returns. The defensive dataset does not just include defenders and goalkeepers - it has all players. The actual values were all converted to a single fixture in a single gameweek value so that the coefficients obtained could be compared among one another.
Running the Model
This data was then divided into 80% training data and 20% test data. X is the feature set, Y is the label and info contains the player name, gameweek and season should we want to access that.
Finally, the linear regression model was fit on the training data and used to predict the test data. Any changes to the model can now be compared based on the error we get on the test data.
The entire dataset formation was repeated for the player data using the new fixtures of the 2020-21 season. This results in a similar dataset, but without the label as we don't know the next 4 games' results yet. And finally, we got the list of attackers and defenders in descending order of expected number of returns in the next 4 gameweeks by using the same model, the results of which are at the top of this post.
Areas for Improvement
1. Players may not have played all the previous games in short term and long term history. How do we tackle this?
We could try dividing the features by number of starts + subs so that it doesn't matter, although this may defeat the purpose of getting long term data in case a player has played very little.
2. Long term features for new players and promoted teams long term features don't exist.
Dividing long term features by number of starts + subs could help here too so that by GW 5, the players will have comparable data. At the same time, the model has examples in previous seasons of new players and promoted teams so it should be able to learn that exceptional performance in the short term is valuable. An alternative is to include their performance from their previous league but that would be too tedious and unnecessary. For now, we have to accept that the model will probably not pick a Werner or a Leeds player for the first couple of weeks. Let's hope it doesn't set us back too much.
3. Old players with new clubs will need to have correct data.
Willian and Doherty still have their Chelsea and Wolves statistics as part of their GW 1 features. So this will have to be updated for the various transfers within the Premier League.
4. Positions of players can be added so that only the defenders and keepers show up in clean sheet list
5. How to optimize team after getting points
Currently, it's a manual process after the predictions. However, it may not be the most optimal or highest scoring team picked from the list of predictions.
6. Some players' club name in training data is still wrong.
This is the same player transfer issue that was mentioned in
Part 4
7. Captaincy pick
A version of the same model to predict returns for a single gameweek as opposed to the next 4 is needed for this. The reason I chose 4 gameweeks for this model was to reduce the randomness.
8. Error metric
As mentioned above, root mean squared error is the metric being used. However, a more custom error function could work well. For example, if the model predicts that Salah will score 7 and he scores 10 in the next 4 games, it's not as bad as it predicting that someone scores 3 and them scoring 0.