A comprehensive ETL (Extract, Transform, Load) pipeline for collecting and storing Premier League standings data from the 2020-21 season through 2024-25 season. This project extracts standings data from the API-Sports.io football API, transforms it into a structured format, and loads it into a MySQL database.
- Overview
- Features
- Project Structure
- Prerequisites
- Installation
- Configuration
- Database Schema
- Code Architecture
- Data Transformation Logic
- Usage
- ETL Pipeline Steps
- API Reference
- Troubleshooting
- Additional Documentation
- Contributing
- License
This project implements a data pipeline that:
- Extracts Premier League standings data from the API-Sports.io REST API
- Transforms the nested JSON response into a flat, structured DataFrame
- Loads the transformed data into a MySQL database using UPSERT operations
The pipeline is designed to handle multiple seasons (2020-2025) and can be easily extended to include additional seasons or leagues.
- β Automated data extraction from API-Sports.io
- β Robust data transformation with error handling
- β UPSERT operations to prevent duplicate records
- β Support for multiple seasons (2020-2025)
- β Structured data storage in MySQL
- β Environment-based configuration management
- β Comprehensive logging and error handling
PL-Standings-2020-25/
β
βββ main.py # Main pipeline script (to be implemented)
βββ test.ipynb # Jupyter notebook for testing and development
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (not in repo)
βββ .gitignore # Git ignore file
βββ README.md # Project documentation
βββ assets/ # Additional assets (if any)
Before you begin, ensure you have the following installed:
- Python 3.9+ (Python 3.11 recommended)
- MySQL Server 5.7+ or MySQL 8.0+
- API-Sports.io API Key (Get one here)
- Git (for cloning the repository)
git clone <repository-url>
cd PL-Standings-2020-25# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtpip install jupyter jupyterlabCreate a .env file in the project root directory with the following variables:
# API Configuration
API_KEY=your_api_sports_io_key_here
API_HOST=v3.football.api-sports.io
# MySQL Database Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=your_mysql_username
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=your_database_nameImportant: Never commit your .env file to version control. It's already included in .gitignore.
- Create the Database:
CREATE DATABASE your_database_name CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;- Create the Standings Table:
CREATE TABLE standings (
season INT NOT NULL,
position INT NOT NULL,
team_id INT NOT NULL,
team VARCHAR(100) NOT NULL,
played INT NOT NULL,
won INT NOT NULL,
draw INT NOT NULL,
lost INT NOT NULL,
goals_for INT NOT NULL,
goals_against INT NOT NULL,
goals_diff INT NOT NULL,
points INT NOT NULL,
form VARCHAR(10),
PRIMARY KEY (season, team_id),
INDEX idx_season (season),
INDEX idx_position (season, position),
INDEX idx_team_id (team_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;Key Points:
- Composite primary key on
(season, team_id)ensures uniqueness per season - Indexes on
season,(season, position), andteam_idfor query optimization - UTF8MB4 character set supports emojis and special characters in team names
The env_vars dictionary is a central data structure that contains all configuration values loaded from the .env file. It's created by the load_environment_variables() function and passed throughout the application.
env_vars = {
"api_key": "your_api_key_here", # API-Sports.io authentication key
"api_host": "v3.football.api-sports.io", # API host URL
"mysql_host": "localhost", # MySQL server hostname
"mysql_port": "3306", # MySQL server port
"mysql_user": "your_username", # MySQL username
"mysql_password": "your_password", # MySQL password
"mysql_database": "your_database_name" # MySQL database name
}-
Created in
main()function:env_vars = load_environment_variables()
-
Passed to functions that need configuration:
# For API calls response = fetch_standings(env_vars["api_key"], LEAGUE_ID, SEASON) # For database operations load_standings_to_database(df, env_vars)
-
Accessed in
load_standings_to_database()function:db_connection = connector.connect( host=env_vars["mysql_host"], port=env_vars["mysql_port"], user=env_vars["mysql_user"], password=env_vars["mysql_password"], database=env_vars["mysql_database"] )
Benefits:
- Keeps sensitive credentials out of code
- Centralizes configuration management
- Makes functions more testable (can pass mock dictionaries)
- Allows easy switching between environments (dev, staging, prod)
The table_name parameter specifies which MySQL table to insert/update data in. In the original notebook, this was a variable called sql_table, but in main.py it's been refactored as a function parameter for better flexibility.
def load_standings_to_database(df, env_vars, table_name="standings"):
# table_name defaults to "standings" if not specified
# Can be overridden when calling the function-
Table Verification:
verify_table_exists(cur, table_name) # Checks if the specified table exists before attempting to insert data
-
UPSERT SQL Statement:
upsert_sql = f""" INSERT INTO {table_name} (season, position, team_id, ...) ... """
Note: In the original test.ipynb notebook, this was a hardcoded variable:
sql_table = "standings" # Original notebook approachThe refactored version in main.py uses a parameter with a default value, making the function more reusable and testable.
The API returns a nested JSON structure:
{
"response": [
{
"league": {
"id": 39,
"name": "Premier League",
"season": 2023,
"standings": [
[
{
"rank": 1,
"team": {
"id": 50,
"name": "Manchester City"
},
"points": 91,
"goalsDiff": 62,
"form": "WWWWW",
"all": {
"played": 38,
"win": 28,
"draw": 7,
"lose": 3,
"goals": {
"for": 96,
"against": 34
}
}
}
]
]
}
}
]
}The transformation logic extracts and flattens the nested structure:
standings_list = response["response"][0]["league"]["standings"][0]For each team in the standings, the following mapping occurs:
| API Field (Nested) | Transformed Field | Data Type | Notes |
|---|---|---|---|
season (from params) |
season |
INT | Season year (e.g., 2023) |
rank |
position |
INT | League position (1-20) |
team.id |
team_id |
INT | Unique team identifier |
team.name |
team |
VARCHAR(100) | Team name |
all.played |
played |
INT | Total matches played |
all.win |
won |
INT | Matches won |
all.draw |
draw |
INT | Matches drawn |
all.lose |
lost |
INT | Matches lost |
all.goals.for |
goals_for |
INT | Goals scored |
all.goals.against |
goals_against |
INT | Goals conceded |
goalsDiff |
goals_diff |
INT | Goal difference |
points |
points |
INT | Total points |
form |
form |
VARCHAR(10) | Recent form (e.g., "WWWWW") |
rows = []
column_names = ['season', 'position', 'team_id', 'team', 'played', 'won',
'draw', 'lost', 'goals_for', 'goals_against', 'goals_diff',
'points', 'form']
for club in standings_list:
season = SEASON
position = club['rank']
team_id = club['team']['id']
team = club['team']['name']
played = club['all']['played']
won = club['all']['win']
draw = club['all']['draw']
lost = club['all']['lose']
goals_for = club['all']['goals']['for']
goals_against = club['all']['goals']['against']
goals_diff = club['goalsDiff']
points = club['points']
form = club['form']
rows.append((season, position, team_id, team, played, won, draw,
lost, goals_for, goals_against, goals_diff, points, form))
df = pd.DataFrame(rows, columns=column_names)The transformation includes implicit validation:
- Type Safety: All numeric fields are extracted as integers
- Completeness: All 20 teams are expected per season
- Data Integrity: Primary key constraint prevents duplicate entries
- Start Jupyter:
jupyter notebook-
Open
test.ipynband run cells sequentially -
Modify the
SEASONvariable to test different seasons:
SEASON = 2023 # Change to 2020, 2021, 2022, 2023, 2024, or 2025To process multiple seasons (2020-2025), you can modify the code to loop through seasons:
seasons = [2020, 2021, 2022, 2023, 2024, 2025]
for season in seasons:
# Set season
SEASON = season
params = {
"league": LEAGUE_ID,
"season": SEASON
}
# Extract
request = requests.get(url=url, headers=headers, params=params)
response = request.json()
# Transform (same logic as above)
# ...
# Load (same logic as above)
# ...-
Load Environment Variables
# Creates env_vars dictionary with all configuration env_vars = load_environment_variables() # Returns: {"api_key": "...", "mysql_host": "...", etc.}
-
Configure API Parameters
# Access API key from env_vars dictionary url = "https://v3.football.api-sports.io/standings" headers = {"x-apisports-key": env_vars["api_key"]} params = {"league": 39, "season": 2023}
-
Make API Request
request = requests.get(url=url, headers=headers, params=params) response = request.json()
-
Validate Response
- Check for API errors:
response.get("errors", []) - Verify results:
response.get("results", 0) > 0
- Check for API errors:
-
Extract Standings Data
standings_list = response["response"][0]["league"]["standings"][0]
-
Flatten Nested Structure
- Iterate through each team in standings
- Extract required fields from nested JSON
- Map to flat structure
-
Create DataFrame
df = pd.DataFrame(rows, columns=column_names)
-
Data Quality Checks
- Verify 20 teams per season
- Check for null values
- Validate data types
-
Establish Database Connection
# Uses env_vars dictionary for connection parameters db_connection = connector.connect( host=env_vars["mysql_host"], port=env_vars["mysql_port"], user=env_vars["mysql_user"], password=env_vars["mysql_password"], database=env_vars["mysql_database"] )
-
Verify Table Existence
# table_name parameter (defaults to "standings") # In original notebook, this was: sql_table = "standings" verify_table_exists(cur, table_name) # or: cur.execute("SHOW TABLES LIKE %s", (table_name,))
-
Prepare Data for Insertion
standings_records_tuples = standings_df.itertuples(index=False, name=None) list_of_standings_records_tuples = list(standings_records_tuples)
-
Execute UPSERT Operation
# Uses table_name parameter (or sql_table in original notebook) UPSERT_SQL = f""" INSERT INTO {table_name} (season, position, team_id, team, played, won, draw, lost, goals_for, goals_against, goals_diff, points, form) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) as src ON DUPLICATE KEY UPDATE position = src.position, team = src.team, played = src.played, won = src.won, draw = src.draw, lost = src.lost, goals_for = src.goals_for, goals_against = src.goals_against, goals_diff = src.goals_diff, points = src.points, form = src.form; """ cur.executemany(UPSERT_SQL, list_of_standings_records_tuples) db_connection.commit()
-
Error Handling
try: cur.executemany(UPSERT_SQL, list_of_standings_records_tuples) db_connection.commit() print(f"[SUCCESS] - Upsert attempted for {len(records)} rows!") except Exception as e: db_connection.rollback() print(f"[ERROR] - Rolled back due to: {e}") finally: cur.close() db_connection.close()
- Base URL:
https://v3.football.api-sports.io - Endpoint:
/standings - Method:
GET - Authentication: Header
x-apisports-key
| Parameter | Type | Required | Description |
|---|---|---|---|
league |
Integer | Yes | League ID (39 for Premier League) |
season |
Integer | Yes | Season year (e.g., 2023) |
- Success: Returns JSON with standings data
- Error: Returns JSON with error messages in
errorsarray - Rate Limits: Check API documentation for current limits
- League ID:
39 - League Name: Premier League
- Country: England
Error: Invalid API key
Solution: Verify your API key in the .env file and ensure it's active.
Error: Can't connect to MySQL server
Solution:
- Verify MySQL server is running
- Check connection parameters in
.env - Ensure firewall allows connections on port 3306
Error: Table 'standings' is NOT found
Solution: Create the table using the SQL script provided in the Database Schema section.
Error: Duplicate entry for key 'PRIMARY'
Solution: This shouldn't occur with UPSERT, but if it does, check that the primary key constraint is correctly defined.
Error: Too many requests
Solution:
- Implement rate limiting between requests
- Use
time.sleep()to add delays - Consider upgrading API plan
-
Enable Verbose Logging
import logging logging.basicConfig(level=logging.DEBUG)
-
Test API Connection
print(f"API Key: {API_KEY[:10]}...") # Print first 10 chars print(f"Response Status: {request.status_code}")
-
Validate DataFrame
print(df.info()) print(df.head()) print(df.isnull().sum())
- Never commit
.envfiles - Already in.gitignore - Use environment variables for all sensitive data
- Limit database user permissions - Use read/write user, not root
- Implement API key rotation - Regularly update API keys
- Use connection pooling for production deployments
- Enable SSL/TLS for MySQL connections in production
- Implement
main.pyscript for production use - Add support for other leagues (Championship, La Liga, etc.)
- Implement data validation and quality checks
- Add automated scheduling (cron jobs, Airflow, etc.)
- Create data visualization dashboards
- Add unit tests and integration tests
- Implement logging to files
- Add data export functionality (CSV, JSON)
- Create API endpoints for data access
- Add data archiving for historical seasons
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- API-Sports.io for providing the football data API
- Premier League for the official standings data
This project includes several documentation files:
- README.md - Main project documentation (this file)
- QUICK_START.md - Quick setup guide for getting started
- DATA_DICTIONARY.md - Detailed field descriptions and data dictionary
- database_schema.sql - Complete database schema with indexes and sample queries
For questions or support, please open an issue in the repository.
Note: This project is for educational and personal use. Ensure compliance with API-Sports.io terms of service and rate limits.