This document breaks down my journey and the insights I gleaned from analysing the football dataset using RStudio.
📊 EDA of Global Football Players
Cleaned, transformed, and visualized using R (tidyverse, plotly, janitor, sf).
Includes demographic analysis, skill profiling, and statistical testing.
The football dataset contains details about football players worldwide. Although the data appears outdated (players are currently 8 years older than in the dataset), it still provides a valuable opportunity to showcase my R skills in unlocking insights. I plan to update the analysis when newer data becomes available.
📥 Get the dataset from here
Tidyverse, Janitor, ggridges, plotly, rnaturalearth,
rnaturalearthdata, rnaturalearthhires, sf
-
Standardized column names using
janitor. -
Investigated and corrected data types. Many numeric columns were incorrectly stored as characters.
-
Identified messy values like
"72+5"and outliers >100, cleaned them, and replaced extreme values withNA.# Data cleaning cleaned_df <- cleaned_df %>% # 1. Trim the string columns of all leading and trailing whitespaces mutate(across(where(is.character), str_trim)) %>% # 2. Detect values in the form "72+5" and extract the digits before the "+" mutate(across(all_of(cols_to_convert), ~ ifelse(str_detect(.x, "\\+"), str_extract("^\\d+"), .x)) ) %>% # 3. Convert the columns to integers and replace value greater than 100 with NA mutate( across(all_of(cols_to_convert), ~ { val <- as.integer(.x) if_else(val > 100, NA, val) } ) )
-
Checked for duplicates and removed them.
-
Handled missing values via median imputation.
# Check for missing values in the the cleaned data. colSums(is.na(cleaned_df)) # Fill the missing values with their respective means cleaned_df <- cleaned_df %>% mutate(across(all_of(cols_to_convert), ~ replace_na(.x, median(.x, na.rm = TRUE))))
-
Created a new feature
positionby mapping the first listed preferred position to one of: Goalkeeper, Defender, Midfielder, Forward.
Age Distribution
Most players are young, averaging 25 years. The distribution is similar across positions.
Nationality Representation
England, Germany, and Spain lead in player counts. Top football nations are concentrated in Europe and South America.
Position Breakdown
Midfielders and defenders dominate the dataset.
Players have similar average overall ratings across positions, with goalkeepers having slightly lower average rating.
Which of the attributes made a player in a position have a good overall rating?
To answer this, I explored correlations between overall rating and specialist attributes, including goalkeeper-specific metrics. GGally:: ggpairs was used to visualise the correlations between the various attributes and the overall ratings.
Goalkeepers With the exception of kicking all the goalkeeper attributes had a strong positive correlation with overall rating.
Defenders
It was revealed that attributes such as standing tackle, sliding tackle, marking, interception among others were strongly correlated with the overall rating of a defender.
Midfielders
Attributes such as vision, ball control, composure, reactions and short passing had good correlations with the overal ratings of a midfielder.
Forwards
Several attributes were found to have strong positive correlations with the overall rating of a forward. These included: reaction, positioning, finishing, composure, ball control amongst others. Interestingly, attributes like aggression, penalties, acceleration and heading accuracy had rather weak correlations with the overall ratings of forwards.
Reaction was the only attribute that had a strong positive correlation with overall rating for all the positions.
Identifying players with exceptional skills in specific attributes:
-
Best dribblers
cleaned_df |> select(name, nationality, position, dribbling) |> filter(dribbling >= 90) |> slice_max(dribbling, n = 5, with_ties = TRUE)
Lionel Messi of Argentina is the best dribbler, followed closely by Neymar.
-
Best free kick takers
cleaned_df |> select(name, nationality, position, free_kick_accuracy) |> slice_max(free_kick_accuracy, n = 5)
Çalhanoğlu and A. Pirlo are the best freekick takers
-
Best finishers - Forwards
cleaned_df |> select(name, nationality, position, finishing) |> filter(position == "Forward") |> slice_max(finishing, n = 10)
The best finishing for forwards are L. Messi, C. Ronaldo and L. Suarez
-
Most powerful shots
cleaned_df |> select(name, nationality, position, shot_power) |> # filter(position == "Forward") |> slice_max(shot_power, n = 5)
Cristiano Ronaldo has the most powerful shot
Players improve from under 20, peak in late 20s to early 30s, then gradually decline. Goalkeepers follow a similar trend but with slightly lower ratings.
-
Compared average attribute profiles across nationalities.
# This gives the top 6 football countries interms of number of football produced top_countries <- cleaned_df |> summarise( num_of_players = n(), average_overall = mean(overall, na.rm = TRUE), .by = nationality ) |> slice_max( num_of_players, n = 6 )
Country Number of players average overll England 1629 63.1 Germany 1135 65.8 Spain 1009 69.9 France 973 67.2 Argentina 961 67.7 Brazil 809 70.9
When these six countries were analysed, it was revealed that Brazil led in average ratings across all positions.
- Compared goalkeeper-specific attributes vs outfielders.
- Goalkeepers excel in diving, reflexes, positioning.
- Outfielders slightly outperform in jumping.
Goalkeepers had lower ratings in attributes outside of their expected attributes
Hypothesis Testing
- Strength: Goalkeepers vs Outfielders
There was a statistically significant difference in strength of goalkeepers and outfield players
- Overall Rating: Goalkeepers vs Outfielders
There was a statistically significant difference in strength of goalkeepers and outfield players
- Players are generally young; midfielders and defenders are most common.
- Europe and South America dominate football talent.
- Goalkeepers have lower average ratings but excel in specialized attributes.
- Messi and Ronaldo top various skill categories.
- Finishing correlates strongly with overall rating for forwards.
- Brazil leads in average ratings across all positions.
- Statistical tests confirm meaningful differences between goalkeepers and outfielders.
📄 Refer to the attached R script file for full reproducibility.
- Cluster positions using PCA + k-means.
- Explore attribute correlations via heatmaps.
- Perform dimensionality reduction.
- Cluster players to uncover latent role groupings.
📬 Contact If you'd like to connect, collaborate, or discuss this project further:
📧 Email: mathiasofosu2@gmail.com
💼 LinkedIn: Mathias Ofosu
🧠 GitHub Profile: Mathias Ofosu
Twitter/X: Mathias Ofosu
Feel free to reach out — I’m always open to data-driven conversations.















