Skip to content

ghfix/PL-Standings-2020-25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PL-Standings-2020-25

A comprehensive ETL (Extract, Transform, Load) pipeline for collecting and storing Premier League standings data from the 2020-21 season through 2024-25 season. This project extracts standings data from the API-Sports.io football API, transforms it into a structured format, and loads it into a MySQL database.

πŸ“‹ Table of Contents

🎯 Overview

This project implements a data pipeline that:

  1. Extracts Premier League standings data from the API-Sports.io REST API
  2. Transforms the nested JSON response into a flat, structured DataFrame
  3. Loads the transformed data into a MySQL database using UPSERT operations

The pipeline is designed to handle multiple seasons (2020-2025) and can be easily extended to include additional seasons or leagues.

✨ Features

  • βœ… Automated data extraction from API-Sports.io
  • βœ… Robust data transformation with error handling
  • βœ… UPSERT operations to prevent duplicate records
  • βœ… Support for multiple seasons (2020-2025)
  • βœ… Structured data storage in MySQL
  • βœ… Environment-based configuration management
  • βœ… Comprehensive logging and error handling

πŸ“ Project Structure

PL-Standings-2020-25/
β”‚
β”œβ”€β”€ main.py                 # Main pipeline script (to be implemented)
β”œβ”€β”€ test.ipynb             # Jupyter notebook for testing and development
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env                    # Environment variables (not in repo)
β”œβ”€β”€ .gitignore             # Git ignore file
β”œβ”€β”€ README.md              # Project documentation
└── assets/                # Additional assets (if any)

πŸ”§ Prerequisites

Before you begin, ensure you have the following installed:

  • Python 3.9+ (Python 3.11 recommended)
  • MySQL Server 5.7+ or MySQL 8.0+
  • API-Sports.io API Key (Get one here)
  • Git (for cloning the repository)

πŸ“¦ Installation

1. Clone the Repository

git clone <repository-url>
cd PL-Standings-2020-25

2. Create a Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Install Jupyter (Optional, for notebook usage)

pip install jupyter jupyterlab

βš™οΈ Configuration

Environment Variables

Create a .env file in the project root directory with the following variables:

# API Configuration
API_KEY=your_api_sports_io_key_here
API_HOST=v3.football.api-sports.io

# MySQL Database Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=your_mysql_username
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=your_database_name

Important: Never commit your .env file to version control. It's already included in .gitignore.

Database Setup

  1. Create the Database:
CREATE DATABASE your_database_name CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
  1. Create the Standings Table:
CREATE TABLE standings (
    season INT NOT NULL,
    position INT NOT NULL,
    team_id INT NOT NULL,
    team VARCHAR(100) NOT NULL,
    played INT NOT NULL,
    won INT NOT NULL,
    draw INT NOT NULL,
    lost INT NOT NULL,
    goals_for INT NOT NULL,
    goals_against INT NOT NULL,
    goals_diff INT NOT NULL,
    points INT NOT NULL,
    form VARCHAR(10),
    PRIMARY KEY (season, team_id),
    INDEX idx_season (season),
    INDEX idx_position (season, position),
    INDEX idx_team_id (team_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Key Points:

  • Composite primary key on (season, team_id) ensures uniqueness per season
  • Indexes on season, (season, position), and team_id for query optimization
  • UTF8MB4 character set supports emojis and special characters in team names

πŸ—οΈ Code Architecture

Environment Variables Dictionary (env_vars)

The env_vars dictionary is a central data structure that contains all configuration values loaded from the .env file. It's created by the load_environment_variables() function and passed throughout the application.

Structure

env_vars = {
    "api_key": "your_api_key_here",           # API-Sports.io authentication key
    "api_host": "v3.football.api-sports.io",   # API host URL
    "mysql_host": "localhost",                # MySQL server hostname
    "mysql_port": "3306",                      # MySQL server port
    "mysql_user": "your_username",             # MySQL username
    "mysql_password": "your_password",         # MySQL password
    "mysql_database": "your_database_name"     # MySQL database name
}

How It's Used

  1. Created in main() function:

    env_vars = load_environment_variables()
  2. Passed to functions that need configuration:

    # For API calls
    response = fetch_standings(env_vars["api_key"], LEAGUE_ID, SEASON)
    
    # For database operations
    load_standings_to_database(df, env_vars)
  3. Accessed in load_standings_to_database() function:

    db_connection = connector.connect(
        host=env_vars["mysql_host"],
        port=env_vars["mysql_port"],
        user=env_vars["mysql_user"],
        password=env_vars["mysql_password"],
        database=env_vars["mysql_database"]
    )

Benefits:

  • Keeps sensitive credentials out of code
  • Centralizes configuration management
  • Makes functions more testable (can pass mock dictionaries)
  • Allows easy switching between environments (dev, staging, prod)

Table Name Parameter (table_name)

The table_name parameter specifies which MySQL table to insert/update data in. In the original notebook, this was a variable called sql_table, but in main.py it's been refactored as a function parameter for better flexibility.

Usage

def load_standings_to_database(df, env_vars, table_name="standings"):
    # table_name defaults to "standings" if not specified
    # Can be overridden when calling the function

Where It's Used

  1. Table Verification:

    verify_table_exists(cur, table_name)
    # Checks if the specified table exists before attempting to insert data
  2. UPSERT SQL Statement:

    upsert_sql = f"""
    INSERT INTO {table_name}
    (season, position, team_id, ...)
    ...
    """

Note: In the original test.ipynb notebook, this was a hardcoded variable:

sql_table = "standings"  # Original notebook approach

The refactored version in main.py uses a parameter with a default value, making the function more reusable and testable.

πŸ”„ Data Transformation Logic

1. API Response Structure

The API returns a nested JSON structure:

{
  "response": [
    {
      "league": {
        "id": 39,
        "name": "Premier League",
        "season": 2023,
        "standings": [
          [
            {
              "rank": 1,
              "team": {
                "id": 50,
                "name": "Manchester City"
              },
              "points": 91,
              "goalsDiff": 62,
              "form": "WWWWW",
              "all": {
                "played": 38,
                "win": 28,
                "draw": 7,
                "lose": 3,
                "goals": {
                  "for": 96,
                  "against": 34
                }
              }
            }
          ]
        ]
      }
    }
  ]
}

2. Transformation Process

The transformation logic extracts and flattens the nested structure:

Step 1: Extract Standings List

standings_list = response["response"][0]["league"]["standings"][0]

Step 2: Transform Each Team Record

For each team in the standings, the following mapping occurs:

API Field (Nested) Transformed Field Data Type Notes
season (from params) season INT Season year (e.g., 2023)
rank position INT League position (1-20)
team.id team_id INT Unique team identifier
team.name team VARCHAR(100) Team name
all.played played INT Total matches played
all.win won INT Matches won
all.draw draw INT Matches drawn
all.lose lost INT Matches lost
all.goals.for goals_for INT Goals scored
all.goals.against goals_against INT Goals conceded
goalsDiff goals_diff INT Goal difference
points points INT Total points
form form VARCHAR(10) Recent form (e.g., "WWWWW")

Step 3: Create DataFrame

rows = []
column_names = ['season', 'position', 'team_id', 'team', 'played', 'won', 
                'draw', 'lost', 'goals_for', 'goals_against', 'goals_diff', 
                'points', 'form']

for club in standings_list:
    season = SEASON
    position = club['rank']
    team_id = club['team']['id']
    team = club['team']['name']
    played = club['all']['played']
    won = club['all']['win']
    draw = club['all']['draw']
    lost = club['all']['lose']
    goals_for = club['all']['goals']['for']
    goals_against = club['all']['goals']['against']
    goals_diff = club['goalsDiff']
    points = club['points']
    form = club['form']
    
    rows.append((season, position, team_id, team, played, won, draw, 
                 lost, goals_for, goals_against, goals_diff, points, form))

df = pd.DataFrame(rows, columns=column_names)

3. Data Validation

The transformation includes implicit validation:

  • Type Safety: All numeric fields are extracted as integers
  • Completeness: All 20 teams are expected per season
  • Data Integrity: Primary key constraint prevents duplicate entries

πŸš€ Usage

Running the Notebook (Development/Testing)

  1. Start Jupyter:
jupyter notebook
  1. Open test.ipynb and run cells sequentially

  2. Modify the SEASON variable to test different seasons:

SEASON = 2023  # Change to 2020, 2021, 2022, 2023, 2024, or 2025

Running for Multiple Seasons

To process multiple seasons (2020-2025), you can modify the code to loop through seasons:

seasons = [2020, 2021, 2022, 2023, 2024, 2025]

for season in seasons:
    # Set season
    SEASON = season
    params = {
        "league": LEAGUE_ID,
        "season": SEASON
    }
    
    # Extract
    request = requests.get(url=url, headers=headers, params=params)
    response = request.json()
    
    # Transform (same logic as above)
    # ...
    
    # Load (same logic as above)
    # ...

πŸ“Š ETL Pipeline Steps

Phase 1: EXTRACT

  1. Load Environment Variables

    # Creates env_vars dictionary with all configuration
    env_vars = load_environment_variables()
    # Returns: {"api_key": "...", "mysql_host": "...", etc.}
  2. Configure API Parameters

    # Access API key from env_vars dictionary
    url = "https://v3.football.api-sports.io/standings"
    headers = {"x-apisports-key": env_vars["api_key"]}
    params = {"league": 39, "season": 2023}
  3. Make API Request

    request = requests.get(url=url, headers=headers, params=params)
    response = request.json()
  4. Validate Response

    • Check for API errors: response.get("errors", [])
    • Verify results: response.get("results", 0) > 0

Phase 2: TRANSFORM

  1. Extract Standings Data

    standings_list = response["response"][0]["league"]["standings"][0]
  2. Flatten Nested Structure

    • Iterate through each team in standings
    • Extract required fields from nested JSON
    • Map to flat structure
  3. Create DataFrame

    df = pd.DataFrame(rows, columns=column_names)
  4. Data Quality Checks

    • Verify 20 teams per season
    • Check for null values
    • Validate data types

Phase 3: LOAD

  1. Establish Database Connection

    # Uses env_vars dictionary for connection parameters
    db_connection = connector.connect(
        host=env_vars["mysql_host"],
        port=env_vars["mysql_port"],
        user=env_vars["mysql_user"],
        password=env_vars["mysql_password"],
        database=env_vars["mysql_database"]
    )
  2. Verify Table Existence

    # table_name parameter (defaults to "standings")
    # In original notebook, this was: sql_table = "standings"
    verify_table_exists(cur, table_name)
    # or: cur.execute("SHOW TABLES LIKE %s", (table_name,))
  3. Prepare Data for Insertion

    standings_records_tuples = standings_df.itertuples(index=False, name=None)
    list_of_standings_records_tuples = list(standings_records_tuples)
  4. Execute UPSERT Operation

    # Uses table_name parameter (or sql_table in original notebook)
    UPSERT_SQL = f"""
    INSERT INTO {table_name}
    (season, position, team_id, team, played, won, draw, lost, 
     goals_for, goals_against, goals_diff, points, form)
    VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) as src
    ON DUPLICATE KEY UPDATE
        position = src.position,
        team = src.team,
        played = src.played,
        won = src.won,
        draw = src.draw,
        lost = src.lost,
        goals_for = src.goals_for,
        goals_against = src.goals_against,
        goals_diff = src.goals_diff,
        points = src.points,
        form = src.form;
    """
    
    cur.executemany(UPSERT_SQL, list_of_standings_records_tuples)
    db_connection.commit()
  5. Error Handling

    try:
        cur.executemany(UPSERT_SQL, list_of_standings_records_tuples)
        db_connection.commit()
        print(f"[SUCCESS] - Upsert attempted for {len(records)} rows!")
    except Exception as e:
        db_connection.rollback()
        print(f"[ERROR] - Rolled back due to: {e}")
    finally:
        cur.close()
        db_connection.close()

πŸ”Œ API Reference

API-Sports.io Football API

  • Base URL: https://v3.football.api-sports.io
  • Endpoint: /standings
  • Method: GET
  • Authentication: Header x-apisports-key

Request Parameters

Parameter Type Required Description
league Integer Yes League ID (39 for Premier League)
season Integer Yes Season year (e.g., 2023)

Response Structure

  • Success: Returns JSON with standings data
  • Error: Returns JSON with error messages in errors array
  • Rate Limits: Check API documentation for current limits

Premier League League ID

  • League ID: 39
  • League Name: Premier League
  • Country: England

πŸ› Troubleshooting

Common Issues

1. API Authentication Error

Error: Invalid API key

Solution: Verify your API key in the .env file and ensure it's active.

2. Database Connection Error

Error: Can't connect to MySQL server

Solution:

  • Verify MySQL server is running
  • Check connection parameters in .env
  • Ensure firewall allows connections on port 3306

3. Table Not Found

Error: Table 'standings' is NOT found

Solution: Create the table using the SQL script provided in the Database Schema section.

4. Duplicate Key Error

Error: Duplicate entry for key 'PRIMARY'

Solution: This shouldn't occur with UPSERT, but if it does, check that the primary key constraint is correctly defined.

5. API Rate Limit

Error: Too many requests

Solution:

  • Implement rate limiting between requests
  • Use time.sleep() to add delays
  • Consider upgrading API plan

Debugging Tips

  1. Enable Verbose Logging

    import logging
    logging.basicConfig(level=logging.DEBUG)
  2. Test API Connection

    print(f"API Key: {API_KEY[:10]}...")  # Print first 10 chars
    print(f"Response Status: {request.status_code}")
  3. Validate DataFrame

    print(df.info())
    print(df.head())
    print(df.isnull().sum())

πŸ” Security Best Practices

  1. Never commit .env files - Already in .gitignore
  2. Use environment variables for all sensitive data
  3. Limit database user permissions - Use read/write user, not root
  4. Implement API key rotation - Regularly update API keys
  5. Use connection pooling for production deployments
  6. Enable SSL/TLS for MySQL connections in production

πŸ“ˆ Future Enhancements

  • Implement main.py script for production use
  • Add support for other leagues (Championship, La Liga, etc.)
  • Implement data validation and quality checks
  • Add automated scheduling (cron jobs, Airflow, etc.)
  • Create data visualization dashboards
  • Add unit tests and integration tests
  • Implement logging to files
  • Add data export functionality (CSV, JSON)
  • Create API endpoints for data access
  • Add data archiving for historical seasons

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • API-Sports.io for providing the football data API
  • Premier League for the official standings data

πŸ“š Additional Documentation

This project includes several documentation files:

πŸ“§ Contact

For questions or support, please open an issue in the repository.


Note: This project is for educational and personal use. Ensure compliance with API-Sports.io terms of service and rate limits.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages