Practical, reproducible data engineering exercises: ingest → clean → load → query.
A small collection of pipeline projects built in Python + SQL with clear run steps and repeatable outputs.
Each pipeline starts from an external or raw source, lands data in SQLite, and writes a report that can be inspected without extra services.
pipelines/ingestion + cleaning scriptssql/analytics and validation queriesscripts/generated-output validation checksdata/local databases + downloaded datasetsreports/generated outputs (CSV summaries)
Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.pymacOS/Linux:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.pyAfter the first run, inspect:
data/titanic.dbreports/titanic_summary.csvreports/data_quality_report.md
- Run the Titanic pipeline to verify ingest, load, and reporting from a clean checkout.
- Run
python scripts/validate_outputs.pyto verify the SQLite table, SQL files, and summary report. - Review
docs/pipeline-contracts.mdfor the expected inputs, storage targets, and output checks. - Review
sql/for the analytics queries behind the reports. - Run the weather pipeline to see an append-style API ingestion example.
- Compare generated CSV reports with the preview screenshots below.
- Python pipeline structure with explicit data and report paths
- CSV ingestion, API ingestion, SQLite loading, and SQL-based summaries
- Reproducible local outputs that do not require cloud credentials
- Data contract validation for generated tables, report schemas, and SQL query execution
- CI smoke test for the CSV pipeline
| Pipeline | Source | Storage | Output | CI |
|---|---|---|---|---|
| Titanic CSV | Public CSV download | data/titanic.db |
reports/titanic_summary.csv |
Yes |
| Weather API | Open-Meteo current weather API | data/weather.db |
reports/weather_summary.csv |
Manual, live API |
The Titanic pipeline has a local validation script and CI coverage:
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.py
python scripts/validate_outputs.pyThe validation step checks the generated SQLite table, executes the SQL files in sql/, verifies the report schema, confirms the grouped passenger counts reconcile to the source table, and checks the generated data quality report.
See docs/pipeline-contracts.md for the current pipeline contracts.
Creates:
data/titanic.dbreports/titanic_summary.csvreports/data_quality_report.md
Run:
python pipelines/01_ingest_to_sqlite.py
python scripts/generate_data_quality_report.pyAppends current weather snapshots for a few cities.
Creates:
data/weather.db
Updates:
reports/weather_summary.csv
Run:
python pipelines/02_weather_api_to_sqlite.py
