This project implements an end-to-end Databricks Lakehouse pipeline for high-volume e-commerce event data.
The goal is to demonstrate modern data engineering practices including medallion architecture (Bronze/Silver/Gold), Delta Lake, Spark-based ingestion and transformation, and workflow orchestration using Databricks serverless compute.
The pipeline ingests synthetic clickstream-style events, validates and cleans them, and produces analytics-ready datasets suitable for BI tools and downstream consumers.
High-level flow:
- Synthetic e-commerce events are generated using Apache Spark
- Raw events are ingested into Bronze Delta tables
- Data is cleaned, validated, and deduplicated into Silver Delta tables
- Business-ready facts and aggregates are produced in Gold Delta tables
- The entire pipeline is orchestrated using Databricks Workflows on serverless compute
The dataset is synthetically generated to simulate large-scale e-commerce traffic.
This approach allows:
- Full control over volume and schema
- Public sharing of the repository (no licensing or NDA constraints)
- Testing of partitioning, deduplication, and incremental processing patterns
Event types include:
page_viewproduct_viewadd_to_cartcheckout_startedpurchase
Each event contains user, session, product, device, platform, and marketing attribution fields.
Table: ecomm_lakehouse.bronze_events_raw
- Stores raw JSON event payloads
- Append-only
- Includes ingestion metadata (
ingest_ts,ingest_batch_id, source) - No transformations beyond basic capture
Purpose: immutability and traceability
Tables:
ecomm_lakehouse.silver_events_cleanecomm_lakehouse.silver_events_quarantine
Silver Clean
- Parsed and strongly typed columns
- Deduplicated by
event_id - Partitioned by
event_date - Enforced basic validation rules (required fields, non-negative values)
Silver Quarantine
- Records that fail validation
- Raw JSON preserved with a failure reason
Purpose: data quality and correctness
Tables:
ecomm_lakehouse.gold_fact_purchasesecomm_lakehouse.gold_mart_funnel_daily
Gold tables contain analytics-ready datasets:
- Purchase facts with revenue metrics
- Daily funnel aggregates (page → product → cart → checkout → purchase)
Purpose: business analytics and reporting
The pipeline is orchestrated using Databricks Workflows with three dependent tasks:
-
01_generate_bronze
Generates synthetic e-commerce events and writes to the Bronze table
(parameterized withrowsto control data volume) -
02_bronze_to_silver
Parses raw JSON, validates records, deduplicates events, and writes to Silver tables -
03_silver_to_gold
Builds analytics-ready Gold tables and aggregates
All tasks run on Databricks serverless compute, which abstracts infrastructure management while still executing Spark workloads.
The pipeline enforces multiple quality checks:
- Required field validation (
event_id,event_ts,event_name,user_id) - Enum validation for event types
- Non-negative price checks
- Quantity validation for purchase-related events
- Deduplication using window functions
Invalid records are quarantined rather than dropped, enabling auditability and debugging.
Key design choices:
- Delta Lake for ACID transactions and schema enforcement
- Partitioning by
event_datefor efficient reads - Parameterized ingestion to simulate large volumes (millions of rows per run)
- Idempotent transformations suitable for retries and backfills
The same design can scale to cloud object storage (S3 / ADLS) in a production environment.
databricks-ecomm-lakehouse/
├── notebooks/
│ ├── 01_generate_bronze.py
│ ├── 02_bronze_to_silver.py
│ └── 03_silver_to_gold.py
├── jobs/
│ └── ecomm_lakehouse_job.json
├── diagrams/
│ └── architecture.png
├── screenshots/
│ ├── job_graph.png
│ └── gold_funnel.png
└── README.md
- Create a Databricks workspace (Free / Trial)
- Import notebooks into the workspace
- Create a Databricks Workflow with the three notebooks
- Pass the
rowsparameter to the ingestion task - Run the job and inspect Bronze, Silver, and Gold tables
- Add dbt models and tests on top of Gold tables
- Implement incremental Silver processing using batch identifiers
- Add data freshness and row-count quality gates
- Extend dataset with product and customer dimension tables
- Add BI dashboards (Databricks SQL / external tools)
This project demonstrates:
- End-to-end data pipeline design
- Spark + Delta Lake usage
- Medallion architecture best practices
- Orchestration with Databricks Workflows
- Practical data quality handling at scale
