Plot Shots & Goals Maps with Python & MplSoccer. Ranking Players by xG, Shots, and Goals per 90.

Mikhail Borodastov
9 min readMar 18, 2024

--

1. Overview of the Visualization

Below is the ranking of the top players in the Top 5 European leagues by goal-scoring frequency, accurate as of March 18, 2024.

Key Details of the Visualization:

  1. The primary metric used to rank the players is non-penalty goals per 90 minutes. In the visualization, this metric is represented as the first statistic on each map, highlighted in green.
  2. Areas from which the player has taken shots during the current season are colored in blue on each map. The brighter the color, the more shots have been taken from that area.
  3. Red dots represent all the goals scored, excluding penalties.
  4. The semi-circle indicates the average distance from which the player takes shots.
  5. Additionally, the statistic SPA (Shot from Penalty Area) is provided, indicating the percentage of shots from the penalty area.
  6. For this map, a filter of 14 90-minute intervals was used to include only players who have spent more than 50% of the playtime on the field (this can be adjusted if desired).

Observations:

  • Only two players in Europe average more than 1 goal per 90 minutes of playtime: Kane and Guirassy.
  • Among all players, only Haaland has a negative difference between npG and npxG, which is highly unusual for top scorers. This highlights that the Norwegian is underperforming and significantly lacking in goals this season.
  • Additionally, Haaland has the lowest average shooting distance — 10.7 meters, the highest percentage of shots from the penalty area — 92%, and the highest danger per shot — 0.21 npxG/S.
  • Bellingham makes it into the top 9 by taking significantly fewer shots compared to other players in the ranking. Averaging only 2.5 shots per game, he and Morata share the top spot for shooting accuracy — 52%.

A similar ranking could be constructed by sorting players according to another metric, for example, by the frequency of npxG per 90 minutes.

Observations:

  • Kane remains in first place, averaging 0.91 npxG per game and outperforming his expected metrics.
  • Haaland, being 9th in actual efficiency, moves up to 2nd place in expected efficiency with 0.83 npxG per game.
  • Guirassy, on the other hand, drops to 7th place in the xG ranking, which highlights how significantly he outperforms relative to model estimates. (1.08 goals, against 0.68 xG)
  • Nicolas Jackson achieves as many expected goals with his shots as Mbappé, thanks to a closer average distance and higher accuracy. Consequently, in this ranking, Nicolas has the highest danger per shot — 0.24 npxG/S — meaning every fourth shot should result in an expected goal.

I found this type of visualization quite informative. I first saw it on @sonofacorner’s Twitter, who in turn was inspired by @jonollington.

I decided to modify the template slightly and create my own version, which you saw at the beginning. Below, I will provide a high-level step-by-step breakdown of how to construct such visualizations, along with a Jupyter notebook containing the code.

Let’s get started.

2. Downloading Data

To create the visualization in question, you will need two types of data:

  1. Event data for all matches of the current season in the Top 5 European leagues (specifically, only shots are needed).
  2. Calculated statistics for all football players in Europe (xG, shots, goals, etc.)

The event data can be obtained by parsing whoscored.com. This data should be read into a dataframe.

%%time
path_to_zip_file = '../data/event_data_top5_leagues_18032024_(shots).csv.zip'
file_name = 'event_data_top5_leagues_18032024_(shots).csv'

with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:

with zip_ref.open(file_name) as file:

DF = pd.read_csv(file)

To get the prepared statistics, you can access data from FBref.com, and for this purpose, you can use a Jupyter notebook from my repository available through a provided link. You will need to update the data for the current season by running the cells under Section 3.2. (It’s unnecessary to run Section 3.1, “Collect teams and players statistics for TOP5 European leagues for the last 5 years,” and its associated cells).

This step can be skipped, as the relevant data will already be uploaded (on March 18). Your only requirement will be to clone the repository and load the corresponding statistics from the same notebook.

path2 = '../../Scraping_fbref_static_data/data/current_season'

date = '2024-03-18'

#TOP5 europen leagues current seasons (up-to-{date})
df_statistics = pd.read_csv('/'.join((path2, date, 'top5_leagues_outfields_2023_2024.csv')), index_col=0).reset_index(drop=True)

3. Data Processing

3.1 Data Filtering

First, we filter out all event data, leaving only shots.

mask = DF['type'].apply(lambda x: x in ['SavedShot','MissedShots','Goal','ShotOnPost'])


df_events = DF[mask]
df_events['playerId'] = df_events['playerId'].astype(int)

It’s important to note that shots constitute less than 2% of the total volume of actions performed with the ball. This underscores the fact that xG metrics alone are insufficient for a comprehensive assessment of player and team performance in attack. A significant portion of actions remains outside our field of view. Therefore, it’s crucial to start using metrics such as xT, OBV, PV, etc.

Next, we proceed to an important stage of data preparation. We need to merge two data sources. This will be done by matching team names and player names. To avoid collisions and information loss during the merge, it’s necessary to account for the fact that teams may be named differently on FBref and whoscored, similarly for player names. At the same time, several different players can have the same name.

3.2 Mapping Teams

whoscored_teams = set(df_events['teamName'])
fbref_teams = set(df_statistics['team'])


unmatched_names_list = list(whoscored_teams.difference(fbref_teams))

print(len(unmatched_names_list))

We find that almost 38% of team names do not match.

ChatGPT suggested using fuzzywuzzy library for the mapping task, which takes a string as input and allows finding the most similar one from a set based on Levenshtein distance.

names_list1 = list(unmatched_names_list)
names_list2 = list(fbref_teams)

cutoff = 80

fuzzy_matches = {name : process.extractOne(name, names_list2, score_cutoff=cutoff) for name in names_list1}

sorted_fuzzy_matches = dict(sorted(fuzzy_matches.items(), key=sort_key, reverse=True))

sorted_fuzzy_matches

Most of the teams could be successfully mapped.

However, some required manual mapping. In fact, if additional preprocessing is applied, a higher degree of automatic tagging can be achieved. But teams like PSG and RBL will still need to be manually tagged.

fuzzy_matches['Man_City'] = ('Manchester City',100)
fuzzy_matches['Eintracht_Frankfurt'] = ('Eint Frankfurt',100)
fuzzy_matches['FC_Koln'] = ('Köln',100)
fuzzy_matches['Atletico'] = ('Atlético Madrid',100)
fuzzy_matches['Man_Utd'] = ('Manchester Utd',100)
fuzzy_matches['Deportivo_Alaves'] = ('Alavés',100)
fuzzy_matches['Sheff_Utd'] = ('Sheffield Utd',100)
fuzzy_matches['PSG'] = ('Paris S-G',100)
fuzzy_matches['RBL'] = ('RB Leipzig',100)

3.3 Mapping Players

The situation with players is more complex. Within a single source, the same player can have different names, as in the case of Araujo. Here, the difference lies in the use of diacritics.

However, differences can also arise from other aspects — for names consisting of more than two parts, different combinations may be used at different times. Abbreviations may also be used.

Below is another example where the same name corresponds to 3 different teams. One Vitinha moved from Marseille to Genoa, another plays for PSG. Obviously, if we were to map players by name only, collisions would occur during the join.

In my jupyter notebook, you will find details on how to process the data so that in the end, the two sources considered can be merged accurately and without errors.

3.3 Calculating Additional Statistics

We calculate the proportion of shots taken from outside the penalty area.

df_events['is_in_box'] = [is_inside_box(x,y) for x,y in zip(df_events['y'], df_events['x'])]
df_events_gr = df_events.groupby(['league_name','playerId', 'playerName', 'teamName', 'is_in_box'])['minute'].count().reset_index()
df_events_gr = df_events_gr.pivot(columns='is_in_box', index=['league_name','playerId', 'playerName', 'teamName'], values='minute').reset_index()
df_events_gr.columns = ['league_name','playerId', 'playerName', 'teamName', 'outside','inside']
df_events_gr.fillna(0, inplace=True)
df_events_gr['total'] = df_events_gr['inside'] + df_events_gr['outside']
df_events_gr['share'] = df_events_gr['inside']/df_events_gr['total']
df_events_gr

4. Creating the Visualization

4.1 Data Preparation

The notebook discusses three types of radars:

rating_types = {'shots_per90':'Shots',
'goals_pens_per90':'np Goals',
'npxg_per90':'npxG'}
  • shots_per90 — sorting players by shot frequency
  • goals_pens_per90 — sorting players by goal frequency (first image in the article)
  • npxg_per90 — sorting players by expected goals frequency (second image in the article)

You can choose all Top 5 leagues or just one of them.

leagues = ['Premier-League', 'Bundesliga', 'La-Liga', 'Serie-A', 'Ligue-1']

Next, we form the final list of football players, filtered by the chosen metric.

  • rating_type — the metric used for the ranking
  • min_minutes_90s — the minimum number of played 90-minute intervals
  • rating_size — the number of players in the ranking (if you change from 9 to another number, you can get a deeper dataframe with sorted players, but the visualization will need additional tuning)
min_minutes_90s = 14
rating_size = 9

rating_type = 'shots_per90'

type_ = rating_types[rating_type]

# leagues = ['Premier-League']

columns = ['player','team','xg','npxg','goals','goals_pens','shots','goals_per_shot','npxg_per_shot',
'shots_on_target','shots_on_target_pct','average_shot_distance','games', 'minutes_90s','shots_per90',
'goals_pens_per90','npxg_per90']

mask1 = df_statistics['league_name'].apply(lambda x: x in leagues)
mask2 = df_statistics['minutes_90s'] > min_minutes_90s

df_statistics_filtered = (df_statistics[mask1 & mask2][columns]
.sort_values([rating_type,'goals_pens_per90'], ascending=[False,False])
.head(rating_size)
.reset_index(drop=True)
)
df_statistics_filtered

Add the previously calculated statistics of the percentage of shots from outside the penalty area.

print(df_events_gr.shape)
print(df_statistics_filtered.shape)

DF_merged = pd.merge(df_statistics_filtered, df_events_gr,
left_on = ['player', 'team'],
right_on = ['playerName', 'teamName'],
how='left')

print(DF_merged.shape)

Add club logos to our final dataframe, using the fuzzywuzzy library as well.

# os.listdir('../logos')

path_ = '../logos'

dict_logo = {'Premier_League':'GB1',
'Serie_A':'IT1',
'LaLiga':'ES1',
'Bundesliga':'L1',
'Ligue_1':'FR1'}


for n, row in DF_merged.iterrows():

league = row['league_name']
team_name = row['teamName_plot'].replace('_',' ')

path_to_league_logo = os.path.join(path_, dict_logo[league])

choices = os.listdir(path_to_league_logo)

logo_name_matched = process.extractOne(team_name, choices, score_cutoff=80)

if logo_name_matched != None:
logo_name_matched = logo_name_matched[0]
path_return = os.path.join(path_to_league_logo, logo_name_matched)

DF_merged.loc[n, 'path_logo'] = path_return

Just to be sure, we check that we’ve correctly mapped the logo to the club name.

4.2 Drawing the Map

We execute the following cell, which calls a series of functions from the utils.py module, which we import at the beginning of the notebook. The maps themselves are built using the MplSoccer library, which offers a wide range of functionality for creating various visualizations based on football data.

cmap = cmr.get_sub_cmap('Blues', 0.3, 1)
colorlist= mcp.gen_color(cmap='Blues', n=7)
zoom_factor = 0.15

set_plot_environment()
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(10,16), dpi=300)
axs = axs.flatten() # Flatten the axis array for easy iteration

for index, ax in enumerate(axs):
pitch = draw_pitch(ax)

dict_dfs = preprocessing(DF_merged, df_events, index)

df_all_events = dict_dfs['df_all_events']
df_goals = dict_dfs['df_goals']
df_statistic = dict_dfs['df_statistic']

plot_statistics(ax, pitch, df_all_events, df_goals, cmap)
add_image(ax, df_statistic['path_logo'].values[0], (68, 115), zoom_factor)
plot_semicircle(ax, df_all_events, 105, 34)


player_info = {
'playerName': df_statistic['playerName'].values[0],
'team': df_statistic['teamName'].values[0],
'games': int(df_statistic['games'].values[0]),
'minutes_90s': round(df_statistic['minutes_90s'].values[0],1),
'shots': int(df_statistic['shots'].values[0]),
'npxg_all': round(df_statistic['npxg'].values[0],2),
'npgoals': int(df_statistic['goals_pens'].values[0]),
'share' : df_statistic['share'].values[0],

'S':round(df_statistic['shots_per90'].values[0],1),
'SoT%':int(df_statistic['shots_on_target_pct'].values[0]),
'npG':round(df_statistic['goals_pens_per90'].values[0],2),
'npxG':round(df_statistic['npxg_per90'].values[0],2),
'npxG/S':df_statistic['npxg_per_shot'].values[0]
}

league = 'Europe' if len(leagues) == 5 else leagues[0]

add_annotations(ax, df_all_events, player_info, colorlist)
add_statistics(ax, player_info, type_=type_)
add_title(fig, ax, player_info, league, date, min_minutes_90s, colorlist, type_=type_ )

plt.subplots_adjust(wspace=0, hspace=-0.6)

if date not in os.listdir('../img/'):
os.mkdir(f'../img/{date}/')
if league not in os.listdir(f'../img/{date}/'):
os.mkdir(f'../img/{date}/{league}/')

fig.savefig(f'../img/{date}/{league}/shot_map_top{rating_size}_in_{league}_by_{type_}.jpeg', bbox_inches='tight', dpi=300)

And finally, through the path f'../img/{date}/{league}/shot_map_top{rating_size}_in_{league}_by_{type_}.jpeg', we obtain the corresponding visualization. Now we have a ranking of the most shooting players in Europe. (sorted by shots_per90)

To obtain a similar visualization for Serie A, for example:

  • set leagues = ['Serie-A']
  • choose rating_type, rating_type = 'goals_pens_per90'

And you will receive the visualization.

Github repo links:

Twitter

Telegram

Linkedin

--

--

Mikhail Borodastov

ML Product Manager 🚀 | ex- Data Scientist 📊 | Football Analytics Enthusiast ⚽