Skip to content

Bmowville/data-engineering-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Lab

CI

Practical, reproducible data engineering exercises: ingest → clean → load → query.

What this repo is

A small collection of pipeline projects built in Python + SQL with clear run steps and repeatable outputs.

Each pipeline starts from an external or raw source, lands data in SQLite, and writes a report that can be inspected without extra services.

What you'll find

  • pipelines/ ingestion + cleaning scripts
  • sql/ analytics and validation queries
  • scripts/ generated-output validation checks
  • data/ local databases + downloaded datasets
  • reports/ generated outputs (CSV summaries)

Quick start

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.py

macOS/Linux:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.py

After the first run, inspect:

  • data/titanic.db
  • reports/titanic_summary.csv
  • reports/data_quality_report.md

Technical review path

  1. Run the Titanic pipeline to verify ingest, load, and reporting from a clean checkout.
  2. Run python scripts/validate_outputs.py to verify the SQLite table, SQL files, and summary report.
  3. Review docs/pipeline-contracts.md for the expected inputs, storage targets, and output checks.
  4. Review sql/ for the analytics queries behind the reports.
  5. Run the weather pipeline to see an append-style API ingestion example.
  6. Compare generated CSV reports with the preview screenshots below.

Technical Scope

  • Python pipeline structure with explicit data and report paths
  • CSV ingestion, API ingestion, SQLite loading, and SQL-based summaries
  • Reproducible local outputs that do not require cloud credentials
  • Data contract validation for generated tables, report schemas, and SQL query execution
  • CI smoke test for the CSV pipeline

Pipelines

Pipeline Source Storage Output CI
Titanic CSV Public CSV download data/titanic.db reports/titanic_summary.csv Yes
Weather API Open-Meteo current weather API data/weather.db reports/weather_summary.csv Manual, live API

Validation

The Titanic pipeline has a local validation script and CI coverage:

python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.py

The validation step checks the generated SQLite table, executes the SQL files in sql/, verifies the report schema, confirms the grouped passenger counts reconcile to the source table, and checks the generated data quality report.

See docs/pipeline-contracts.md for the current pipeline contracts.

1) Titanic CSV → SQLite → report

Creates:

  • data/titanic.db
  • reports/titanic_summary.csv
  • reports/data_quality_report.md

Run:

python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py

2) Weather API → SQLite → report

Appends current weather snapshots for a few cities.

Creates:

  • data/weather.db

Updates:

  • reports/weather_summary.csv

Run:

python pipelines/02_weather_api_to_sqlite.py

Titanic summary preview

Titanic summary preview

Weather summary preview

weather

About

Reproducible Python and SQL pipelines for ingest, cleaning, SQLite loading, and analytics reports.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages