Skip to content

cbfacademy/Introduction-to-DBT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Introduction to dbt - Real Olist E-Commerce Dataset (10K Orders)

Welcome to the dbt Intro course! This project uses 10,000 REAL orders from Olist, a Brazilian e-commerce marketplace.


🎯 Why Real Data?

This project uses actual transactions from Olist, a real Brazilian e-commerce platform:

βœ… Authentic business patterns - Real customer behavior and ordering trends
βœ… Genuine geography - Actual Brazilian cities (SΓ£o Paulo, Rio de Janeiro, BrasΓ­lia)
βœ… Real data quality issues - Nulls, late deliveries, cancellations
βœ… Credible for portfolios - Work with data used by 1000+ analysts on Kaggle
βœ… Industry-standard - Same dataset used by data professionals worldwide

Dataset Source: Brazilian E-Commerce Public Dataset by Olist (Kaggle)


πŸ“Š Dataset Overview

This is a 10,000 order subset of the full Olist dataset (100K orders, 2016-2018).

Table Records Description
orders 10,000 Real orders with actual timestamps and order status
order_items 11,253 Products purchased (avg 1.13 items per order)
customers 10,000 Unique customers across Brazil
products 6,124 Product catalog with dimensions
product_category_name_translation 71 Portuguese β†’ English category names

βœ… Verify This is Real Data:

# Check order ID format (MD5 hash like real Olist)
head -2 data/orders.csv | tail -1 | cut -d',' -f1
# Output: e481f51cbdc54678b7cc49136f2d6af7

# Check Brazilian city names
head -2 data/customers.csv | tail -1 | cut -d',' -f4
# Output: lencois paulista (real Brazilian city!)

# Check Portuguese categories
head -2 data/products.csv | tail -1
# Output: moveis_decoracao (furniture_decor in Portuguese!)

Period: January - December 2017 (Full year)
Geography: Real Brazilian cities across 27 states
Categories: 71 authentic product categories
Order Statuses: delivered, shipped, canceled, processing


πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Google Cloud Platform account (free tier works!)
  • Basic SQL knowledge

Setup (15 minutes)

Step 1: Download this project

# Unzip the downloaded file
cd dbt_intro_olist_real

Step 2: Set up BigQuery Follow: setup/01_setup_bigquery.md

Step 3: Load REAL Olist data to BigQuery

cd setup
chmod +x 02_load_data_to_bigquery.sh
./02_load_data_to_bigquery.sh

This loads all 10,000 real orders to BigQuery!

Step 4: Install dbt

pip install dbt-bigquery

Step 5: Configure dbt Copy profiles.yml.example to ~/.dbt/profiles.yml and update with your GCP credentials.

Step 6: Test connection

dbt debug

Step 7: Run your first models!

dbt run
dbt test
dbt docs generate
dbt docs serve

πŸ“ Project Structure

dbt_intro_olist_real/
β”œβ”€β”€ data/                      # 10K REAL Olist CSV files βœ“
β”‚   β”œβ”€β”€ orders.csv             # 10,000 real orders
β”‚   β”œβ”€β”€ order_items.csv        # 11,252 line items
β”‚   β”œβ”€β”€ customers.csv          # 10,000 customers
β”‚   β”œβ”€β”€ products.csv           # 6,123 products
β”‚   β”œβ”€β”€ product_category_name_translation.csv
β”‚   └── DATASET_INFO.txt       # Data authenticity proof
β”œβ”€β”€ setup/                     # BigQuery setup guides
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ staging/              # Clean raw data (views)
β”‚   β”‚   β”œβ”€β”€ _sources.yml      # Source definitions with tests
β”‚   β”‚   β”œβ”€β”€ stg_orders.sql    # Orders with delivery metrics
β”‚   β”‚   β”œβ”€β”€ stg_order_items.sql
β”‚   β”‚   β”œβ”€β”€ stg_customers.sql # Brazilian geography
β”‚   β”‚   β”œβ”€β”€ stg_products.sql
β”‚   β”‚   └── stg_product_categories.sql # Portuguese β†’ English
β”‚   └── marts/                # Business logic (tables)
β”‚       β”œβ”€β”€ customers.sql     # Lifetime value & segments
β”‚       └── orders.sql        # Enriched order facts
β”œβ”€β”€ macros/
β”œβ”€β”€ tests/
β”œβ”€β”€ seeds/
└── exercises/                # 3 hands-on exercises

πŸŽ“ What You'll Learn

Core dbt Concepts

  • βœ… Models - Transform raw data with SQL SELECT statements
  • βœ… Sources - Reference raw tables in BigQuery
  • βœ… ref() - Build dependency graph between models
  • βœ… Tests - Ensure data quality (unique, not_null, etc.)
  • βœ… Materializations - Views vs tables vs incremental
  • βœ… Macros - Reusable SQL functions with Jinja
  • βœ… Documentation - Auto-generate docs with lineage

Real-World Skills

  • βœ… Working with actual e-commerce data (10K orders)
  • βœ… Handling data quality issues (nulls, late deliveries)
  • βœ… Brazilian market geography (real cities and states)
  • βœ… Multi-language data (Portuguese categories β†’ English)
  • βœ… Customer lifetime value analysis
  • βœ… Order fulfillment metrics (delivery times, late orders)

πŸ”‘ Key dbt Commands

Command What It Does
dbt run Build all models (staging + marts)
dbt test Run data quality tests
dbt docs generate Create documentation
dbt docs serve View docs in browser (see DAG!)
dbt run -s stg_orders Run specific model
dbt test -s customers Test specific model

βœ… What You'll Build

After completing this course, you'll have:

Models:

  • 5 staging models (cleaned raw data as views)
  • 2 mart models (business metrics as tables)

Customer Metrics (in customers mart):

  • Lifetime orders per customer
  • Total revenue per customer (Brazilian Real)
  • Average order value
  • Customer segments (High/Medium/Low value)

Order Metrics (in orders mart):

  • Order totals (price + freight)
  • Primary product category per order
  • Late delivery flags (actual vs estimated)
  • Customer location (city, state)

Data Quality:

  • 15+ tests passing
  • Source freshness checks
  • Custom business logic tests

🌎 About the Real Olist Data

What is Olist?

Olist is a real Brazilian company that connects small businesses to major marketplaces. This dataset contains actual anonymized transactions from their platform.

Geographic Coverage

Real orders from across Brazil:

  • SΓ£o Paulo (SP) - Brazil's largest city and economic hub
  • Rio de Janeiro (RJ) - Second largest city
  • BrasΓ­lia (DF) - Capital city
  • Plus 24 other Brazilian states

Product Categories (Portuguese β†’ English)

Real Olist categories:

  • cama_mesa_banho β†’ bed_bath_table
  • beleza_saude β†’ health_beauty
  • esportes_lazer β†’ sports_leisure
  • informatica_acessorios β†’ computers_accessories
  • moveis_decoracao β†’ furniture_decor
  • And 66 more authentic categories!

Data Quality (Real-World Issues)

This is real data, so you'll encounter:

  • βœ… Successful deliveries (~85% of orders)
  • ⚠️ Late deliveries (some orders exceed estimated date)
  • ❌ Canceled orders (~4% cancellation rate)
  • πŸ“¦ Orders in transit (no delivery date yet)
  • πŸ’° Various payment amounts in Brazilian Real (R$)

πŸ’‘ Learning Path

  1. Explore the real data (15 min) - Check actual Brazilian orders in BigQuery
  2. Run staging models (15 min) - Clean and standardize raw data
  3. Exercise 1: Create your first model (20 min)
  4. Build mart models (30 min) - Calculate customer lifetime value
  5. Exercise 2: Use ref() and macros (20 min)
  6. Add tests (20 min) - Ensure data quality
  7. Exercise 3: Write your own tests (20 min)
  8. Generate docs (10 min) - See your lineage graph!

Total: 3.5 hours of hands-on learning with 10,000 real orders!


πŸŽ‰ Ready to Start?

  1. βœ… Complete BigQuery setup (15 min)
  2. βœ… Load 10K real Olist orders to BigQuery
  3. βœ… Run dbt run - Build models!
  4. βœ… Run dbt test - Verify data quality!
  5. βœ… Run dbt docs serve - See your pipeline!
  6. βœ… Complete exercises!

You're working with the same data that 1000+ analysts use on Kaggle! πŸš€


πŸ“š Additional Resources


πŸ† Why This Training is Special

Real Data = Real Learning

  • βœ… Portfolio-worthy: "Built dbt pipeline with 10K Olist orders"
  • βœ… Verifiable: Order IDs match Kaggle dataset
  • βœ… Industry-recognized: Same data pros use for case studies
  • βœ… Authentic patterns: Real customer behavior, not simulated
  • βœ… Career-ready: Skills transfer directly to production work

Complete Training Package

  • βœ… Real 10K order dataset (included!)
  • βœ… Production-quality dbt project
  • βœ… Comprehensive documentation
  • βœ… 3 hands-on exercises with solutions
  • βœ… Ready to run (setup takes 15 minutes)

πŸ“ Data License

CC BY-NC-SA 4.0 (Creative Commons)

This is real commercial data made publicly available by Olist on Kaggle for educational and research purposes. The data has been anonymized (customer names removed, company names in reviews replaced with Game of Thrones houses).

Original Source: Olist (https://olist.com/)
Published On: Kaggle
Your Subset: 10,000 orders from 2017 (sampled from 100K total)


πŸš€ Let's build with REAL data!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages