Databricks E-commerce Lakehouse (End-to-End Data Engineering Project)

Overview

This project implements an end-to-end Databricks Lakehouse pipeline for high-volume e-commerce event data.
The goal is to demonstrate modern data engineering practices including medallion architecture (Bronze/Silver/Gold), Delta Lake, Spark-based ingestion and transformation, and workflow orchestration using Databricks serverless compute.

The pipeline ingests synthetic clickstream-style events, validates and cleans them, and produces analytics-ready datasets suitable for BI tools and downstream consumers.

Architecture

High-level flow:

Synthetic e-commerce events are generated using Apache Spark
Raw events are ingested into Bronze Delta tables
Data is cleaned, validated, and deduplicated into Silver Delta tables
Business-ready facts and aggregates are produced in Gold Delta tables
The entire pipeline is orchestrated using Databricks Workflows on serverless compute

Data Source

The dataset is synthetically generated to simulate large-scale e-commerce traffic.
This approach allows:

Full control over volume and schema
Public sharing of the repository (no licensing or NDA constraints)
Testing of partitioning, deduplication, and incremental processing patterns

Event types include:

page_view
product_view
add_to_cart
checkout_started
purchase

Each event contains user, session, product, device, platform, and marketing attribution fields.

Medallion Architecture

🥉 Bronze Layer

Table: ecomm_lakehouse.bronze_events_raw

Stores raw JSON event payloads
Append-only
Includes ingestion metadata (ingest_ts, ingest_batch_id, source)
No transformations beyond basic capture

Purpose: immutability and traceability

🥈 Silver Layer

Tables:

ecomm_lakehouse.silver_events_clean
ecomm_lakehouse.silver_events_quarantine

Silver Clean

Parsed and strongly typed columns
Deduplicated by event_id
Partitioned by event_date
Enforced basic validation rules (required fields, non-negative values)

Silver Quarantine

Records that fail validation
Raw JSON preserved with a failure reason

Purpose: data quality and correctness

🥇 Gold Layer

Tables:

ecomm_lakehouse.gold_fact_purchases
ecomm_lakehouse.gold_mart_funnel_daily

Gold tables contain analytics-ready datasets:

Purchase facts with revenue metrics
Daily funnel aggregates (page → product → cart → checkout → purchase)

Purpose: business analytics and reporting

Orchestration

The pipeline is orchestrated using Databricks Workflows with three dependent tasks:

01_generate_bronze
Generates synthetic e-commerce events and writes to the Bronze table
(parameterized with rows to control data volume)
02_bronze_to_silver
Parses raw JSON, validates records, deduplicates events, and writes to Silver tables
03_silver_to_gold
Builds analytics-ready Gold tables and aggregates

All tasks run on Databricks serverless compute, which abstracts infrastructure management while still executing Spark workloads.

Data Quality & Validation

The pipeline enforces multiple quality checks:

Required field validation (event_id, event_ts, event_name, user_id)
Enum validation for event types
Non-negative price checks
Quantity validation for purchase-related events
Deduplication using window functions

Invalid records are quarantined rather than dropped, enabling auditability and debugging.

Scalability & Performance Considerations

Key design choices:

Delta Lake for ACID transactions and schema enforcement
Partitioning by event_date for efficient reads
Parameterized ingestion to simulate large volumes (millions of rows per run)
Idempotent transformations suitable for retries and backfills

The same design can scale to cloud object storage (S3 / ADLS) in a production environment.

Repository Structure

databricks-ecomm-lakehouse/
├── notebooks/
│ ├── 01_generate_bronze.py
│ ├── 02_bronze_to_silver.py
│ └── 03_silver_to_gold.py
├── jobs/
│ └── ecomm_lakehouse_job.json
├── diagrams/
│ └── architecture.png
├── screenshots/
│ ├── job_graph.png
│ └── gold_funnel.png
└── README.md

How to Run

Create a Databricks workspace (Free / Trial)
Import notebooks into the workspace
Create a Databricks Workflow with the three notebooks
Pass the rows parameter to the ingestion task
Run the job and inspect Bronze, Silver, and Gold tables

Future Improvements

Add dbt models and tests on top of Gold tables
Implement incremental Silver processing using batch identifiers
Add data freshness and row-count quality gates
Extend dataset with product and customer dimension tables
Add BI dashboards (Databricks SQL / external tools)

Key Takeaways

This project demonstrates:

End-to-end data pipeline design
Spark + Delta Lake usage
Medallion architecture best practices
Orchestration with Databricks Workflows
Practical data quality handling at scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks E-commerce Lakehouse (End-to-End Data Engineering Project)

Overview

Architecture

Data Source

Medallion Architecture

🥉 Bronze Layer

🥈 Silver Layer

🥇 Gold Layer

Orchestration

Data Quality & Validation

Scalability & Performance Considerations

Repository Structure

How to Run

Future Improvements

Key Takeaways

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
diagrams		diagrams
jobs		jobs
notebooks		notebooks
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Databricks E-commerce Lakehouse (End-to-End Data Engineering Project)

Overview

Architecture

Data Source

Medallion Architecture

🥉 Bronze Layer

🥈 Silver Layer

🥇 Gold Layer

Orchestration

Data Quality & Validation

Scalability & Performance Considerations

Repository Structure

How to Run

Future Improvements

Key Takeaways

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages