The introduction to basic tools of football Data Scientist. Where to find free data and how to start.

Mikhail Borodastov
12 min readNov 19, 2022

--

In this article we will show an example of how by using Python you can get data from fbref.com and then visualize football metrics with StatsBomb radars.

All analytics begins with data. There are different football metrics which are evaluated based on collected data and then displayed in the convenient form for perception and following analysis. In the football industry specialists who are referred to Football Data Scientists are engaged in searching for different patterns in data, calculating all kinds of metrics and following visualization.

Football analytics is actively evolving through all over the world. There is a growing demand for people who are able to work with main tools for data handling in analytic departments. One of the most popular tools for these goals is the Python programming language.

Python is considered as one of the simplest programming language by threshold of entry in comparison with other alternatives. There are plenty of examples when this language is learned and successfully used by people even without formal basic tech background. Undoubtedly, without an advanced knowledge of math and algorithms some areas of application could remain closed to you, but meanwhile there are a lot of tasks which you can learn to solve using this programming language without having a background in tech.

This article will demonstrate an example of how to use Python to get free football data that will be used for visualization of player radars. Football radars are a commonly used method for football metrics visualization in the industry. This in turn enables to conduct complex evaluation of both players and teams.

Part 1 — Free provider’s data

There are a lot of sources of free data in the internet, you can find both event data and tracking data. Often data suppliers themself provide their data for introduction and for marketing goals, but there are some difficulties to work with such data.

First issue is that provided data basically are not very recent and available in short volume. Below are examples of tournaments and most recent seasons for which free event data from StatsBomb are available. The longest history of records is provided for Barcelona. Inside StatsBomb this dataset is positioned as most complete source of data by Lionel Messi and named — The Lionel Messi Data Biography.

Among top Europe leagues there is just EPL season 2003–2004 besides La Liga, moreover just Arsenal games are available, which then took first place.

I guess that showed table clearly demonstrates that it is very unlikely to get up-to-date and interesting insights from this data. However, undoubtedly, it might be very useful as a general introduction and development of your own skills for handling football data.

Another issue is that data vendors do not provide readily aggregated stats by players. Free access is closed for these kind of information, at least at StatsBomb. Why is it challenge?

If you are at the very start of diving into the world of football analysis then you will want to move consequently from simple to complex. There are some difficulties for the newcomer which come with calculating seemingly simple metrics. Some of the statistics you probably will not calculate without clarifications from the provider side.

Thus, if you do not buy a subscription for complete access to vendor provided data, which could be an expensive option for individuals, then you will be very limited in the scope of actions.

Part 2 — Free data from internet

We have another option — internet resources with a huge number of calculated metrics based on event data. Fbref is the most popular website that I know of.

For all top leagues there are standard metrics and also aggregated metrics, which Fbref calculates on its own using raw event-data from StatsBomb. Such data is available for 2017–2018 seasons.

This source is a pretty good dataset for getting a quick overview of basic football metrics and descriptive statistics for this data. Usually descriptive statistics implies calculating average, maximum, minimum, percentiles and other values, which give a basic picture regarding specifications of explored data. Further we can see an example of how this kind of statistic is used in football radars and how it can be helpful for us.

Part 3 — Parsing data from fbref.com

Above we’ve demonstrated that data from fbref is good enough for a quick introduction to football analytics. This source is free and contains calculated metrics from StatsBomb. However, there is one restriction — fbref does not provide tools for downloading this data (in programming it’s referred to as API).

That table with columns which you can see in your web browser on the fbref.com website is stored as an html-code. Using Python you can download this code and then pull out useful information from for you (in our case the data about players and teams). This process is known as data parsing.

If you see amature football visualizations in twitter or other sources then in most cases data parsing will be used.

Now it is important to note that there is a special notion in the programming world — open source. It is paradigm which implies that people all over the world write their software for different purposes and then share it on the internet . Football is no exception.

Many programs for data parsing, complex metrics calculation, drawing passes and shots on the template of football pitch are realized and uploaded into internet (one the most cases on Python). All you need to do is to take them, probably, correct, modify and apply them to your tasks.

Let’s come back to the initial task. Let’s assume that I need to get data from fbref, I can’t tell that I know quite well how to parse websites, know some superficial details and have never done it professionally. But I know Python and make a brave assumption that probably someone already has done this task earlier. So what do I do? I google and find out several links with ready-made software solutions and chose the appropriate one.

Not everything worked out after straightforward copy. I got to add some code, but a huge amount of work was done by others before. As a result of this, we get a ready-to-work table.

Part 4 — Original StatsBomb radar

Below is last year radar from StatsBomb for Kylian Mbappe. This radar was found on official StatsBomb twitter, where they usually are posted. On a regular basis, such kinds of visualizations are available just for Statsbomb IQ subscribers..

If you haven’t got a subscription, but you have an interest and desire to compare other football players or teams using this kind of visualization, then you can use fbrief data and a ready-made library for building almost identically radars. We already got the data. Googling, write key words and find out a suitable solution for visualizing football metrics on radar templates would be next.

First of all, to build a radar chart you need to create a template, that will highlight corresponding boundaries and intermediate values. These values might be taken from the history of previous seasons. You need to plot a values distribution for each specific metric for that (found on the right side on StatsBomb radar).

Part 5 — Values distributions for metrics

Why are these distributions needed and what are they?

To provide you with the simplest example — you are reading a newly published analytical report in which xG metric is assigned to some player, let’s assume that it will be equal to 0.35 in average for a game after one half of the season. Can you somehow estimate that it is a high or low value? What will you use to compare it against? Even if you compare several players with each other, then how will they be compared against the rest of the players who are beyond the scope of this report?

Just for this purpose plotting distributions of values on first steps data introduction is very useful and it helps to get general view. The following basic statistical values are calculated: average, maximum, minimum, and other.

Below is the original distribution for non-penalty xG (npxG) metric based on StatsBomb data and another similar one which might be plotted on fbref data.

Several comments regarding used preprocessing details for plotting distributions (it was done on Nov 2021):

  • only strikers from the Top5 leagues for last 4 seasons: from 2017–2018 till 2020–2021 (StatsBomb use 5 seasons)
  • removed players with less than half season games — below 19 games ( there is no clear guidance regarding threshold on the original StatsBomb’s radar , in some sources I found out reference about 900 minutes ~ 10 games)
  • built histogram for selected players (show how frequently found some of metric’s value)
  • built one more graph above histogram — probability density function for npxG. It’s not necessary to go deep in this term on this moment. You can look on it like on smooth version of histogram, which might be defined and named as “distribution of metrics value”.

On the plotted distribution we see:

  • mimimum npxG per 90 minutes for Strikers from Top5 leagues for last 4 seasons is 0.08 (it’s Adama Traore in the season 2020–2021 in Wolverhampton. His position on Fbref till that season was tagged as FW, but in 2020–2021 position was changed on dual version — “FW,MF”, what is probably closer to the truth)
  • maximum — 1.01 (created danger on average is equivalent to 1 goal per game — it’s Robert Lewandowski in Bayern season 2017–2018)
  • 5th percentile — value for lower bound of radar is equal 0.18 (only 5% of all strikers have lower values on all viewed history). These players are located to the left from first boundary.
  • 95th percentile — upper bound is equal 0.61 (It’s Ronaldo metric in Juventus 2018–2019 (at 33 years old!). 95% of strikers showed lower values for 4 previous seasons.
  • 50th percentile or median — is equal 0.35 npxG per game. This value divide all our population of strikers on two parts. One half have higher values, another have lower.
  • Mbappe shows 0.58 npxG and almost hits into Top 5% best players in our statistic. Value for this player is calculated on base of season 2021–2022 (November 2021).

Comparison to original radar (why there are some difference between them)

The main important thing, which I have managed to find out, consists that fbref and StatsBomb count game time in different way. If you take up-to-date StatsBomb radar and have a look on the fbref table you will see some difference in 90s (90 minutes intervals). StatsBomb count effective game time (considering adding minutes), Fbref do not count extra time. Usually for the players which have a lot of playing time this difference is plus (1 ÷ 1.5) 90s during the season — one game or even more.

It turns out that StatsBomb use more minutes when calculate own per-game metrics, that implies to decreasing values of received metrics in comparison with the same values from Fbref.

It is worth noting that definition for some of the players positions might be different in StatsBomb and Fbref. I used mark FW for plotting distribution, but the same mark also is used in pair with others for some players, for example — ‘FW,MF’, ‘DF,FW’, ‘MF,FW’, ‘FW,DF’. Probably, population of strikers in my calculations and population of strikers at StatsBomb have some difference (among viewed seasons) because of this.

Sample in my distribution — 4 seasons, StatsBomb — 5. It inevitable brings to some discrepancies in values.

There was difference between game counts on the moment of calculation metrics and plotting radars — on Fbref was available statistics after 12 games, but for StatsBomb radar was used 10 games.

Going ahead it’s necessary to point out that some metrics based on event-data are calculated differently by StatsBomb and Fbref. People from StatsBomb confirmed that fact. They indicated that fbref calculate Interceptions another way. I also found out distinguishes for some metrics for example Pressure regains and quite simple metric Touches in Box.

All of the above point out that we can’t get identical to StatsBomb values, but anyway will be able to get very similar picture.

You might do not peer to the digits, additional details will be provided in the report, which I will add on the end of article. Main attention should be focused on form of graphics and approximate place of Mbappe values. Clearly, that in overall distributions are very similar. Exceptions are several metrics for which maximum values are diverse at different sources, that leads to narrowing or expanding the boundaries of observed distribution.

For plotting radars there will be needed 5th and 95th percentiles and values Mbappe’s metrics.

Part 6 — Radar building

I used ready-made library founded in internet. My task was to prepare correct data and in appropriate way to use already written functions.

We couldn’t get absolutely identical radars. Reasons was described earlier. Overall, the general picture on our radar is not distorted in comparison with original one.

Now you can easily plot actual radars for favorite players from TOP5 leagues. Also it is possible to shape your personal template using just specific metrics.

Let’s have a look on players from TOP 5 leagues which got in 5% the best strikers by npxG for 4 last years (actual on November 2021). We should filter fbref data by this season (2021) and remain just strikers, which have npxg_per90 more or equal 0.61. Also I set limit on minimum number of played games equal 5.

Now it is possible for each of them just in few clicks to plot radar. (Ideally, this process might be automized — get radars for best players at the end of the tour after fbref data updating)

Part 7 — Summary

In this article I have tried in common view to describe how by using Python you can get free actual football metrics based on event-data and gave an example of studying these metrics through observing of distributions and following visualization in the radar form.

Given fbref data you can use for training basic skills of handling table data, searching for correlations and plotting different visualizations in Python.

Undoubtedly, single programming skill in Python is not sufficient to get into industry. And Data Scientist responsibilities are not limited tasks of plotting visualizations based on data. Everything is a little bit harder.

Needed skills and supposed job responsibilities may be different from club to club or analytics company. For example, here is job description for Data Scientist role in the team to Sarah Rudd. (She suggested metric based on Markov Chains in the far 2011, that became prototype of very well known xT metric).

In general one of the essential part of Data Scientist job is building machine learning models. They are built for getting new more effective metrics (such as VAEP, OBV) or game analysis on set pieces, for example. I intentionally dropped this topic in the article, but it’s necessary to understand, that this process is inextricably linked with profession of Data Scientist. Before starting to create ML-model you need firstly to get some basics about Python language and some initial statistics knowledge.

If you are interested to know which models are built by football DS’s and how they are used, then you can watch video below. There is created ML-model for calculating VAEP metric on free event-data from Wyscout on season 2017–2018.

Studying Python, main libraries for analysis and basis of statistics might be considered as first steps on the way into football analytics and Football Data Science.

Here you can find project on GitHub, which you can use for getting free data and plotting StatsBomb radars.

--

--

Mikhail Borodastov
Mikhail Borodastov

Written by Mikhail Borodastov

ML Product Manager 🚀 | ex- Data Scientist 📊 | Football Analytics Enthusiast ⚽

No responses yet