Skip to content

cbfacademy/Advance-DBT

Repository files navigation

Advanced dbt - Real Olist E-Commerce Dataset

dbt BigQuery License

Production-grade dbt project for the Advanced dbt training module. Built with 10,000 real orders from Olist Brazilian E-commerce.


🎯 Learning Objectives

After completing this module, you will be able to:

  • βœ… Build robust tests with dbt_expectations and custom generic tests
  • βœ… Implement data contracts (dbt 1.5+) and unit tests (dbt 1.8+)
  • βœ… Create reusable macros and leverage dbt packages
  • βœ… Set up CI/CD pipelines with GitHub Actions
  • βœ… Implement Slim CI for faster PR validation
  • βœ… Monitor pipeline performance and troubleshoot issues
  • βœ… Use dbt artifacts for custom tooling

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Google Cloud Platform account
  • dbt 1.7+ (pip install dbt-bigquery)
  • Basic dbt knowledge (models, ref, source)

Setup (15 minutes)

# 1. Clone the repository
git clone <repository-url>
cd dbt_advanced_olist

# 2. Run setup script (creates venv, installs dbt)
chmod +x scripts/setup.sh
./scripts/setup.sh

# 3. Configure profiles.yml
cp profiles.yml.example ~/.dbt/profiles.yml
# Edit ~/.dbt/profiles.yml with your GCP project ID and keyfile path

# 4. Load data to BigQuery (REQUIRED!)
export GCP_PROJECT_ID="your-project-id"
chmod +x scripts/load_data_to_bigquery.sh
./scripts/load_data_to_bigquery.sh

# 5. Test connection
source venv/bin/activate
dbt debug

# 6. Build the project
dbt build

Manual Data Loading (Alternative)

If the script doesn't work, load CSVs manually via BigQuery Console:

  1. Go to BigQuery Console β†’ Create Dataset named raw_data
  2. For each CSV in data/ folder: Create Table β†’ Upload β†’ Select CSV
  3. Use schema auto-detection or refer to scripts/load_data_to_bigquery.sh for column types

πŸ“ Project Structure

dbt_advanced_olist/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ staging/           # Source cleaning (views)
β”‚   β”‚   β”œβ”€β”€ _sources.yml   # Source definitions + freshness
β”‚   β”‚   β”œβ”€β”€ _staging.yml   # Schema tests
β”‚   β”‚   └── stg_*.sql      # Staging models
β”‚   β”œβ”€β”€ intermediate/      # Business logic joins
β”‚   β”‚   └── int_*.sql      
β”‚   └── marts/             # BI-ready tables
β”‚       β”œβ”€β”€ _marts.yml     # Data contracts + unit tests
β”‚       β”œβ”€β”€ dim_customers.sql
β”‚       └── fct_orders.sql
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ generic/           # Reusable test macros
β”‚   β”‚   β”œβ”€β”€ test_is_positive.sql
β”‚   β”‚   β”œβ”€β”€ test_is_not_negative.sql
β”‚   β”‚   β”œβ”€β”€ test_is_recent.sql
β”‚   β”‚   └── test_row_count_delta.sql
β”‚   └── singular/          # Custom SQL tests
β”‚       β”œβ”€β”€ assert_orders_have_items.sql
β”‚       └── assert_delivery_dates_logical.sql
β”œβ”€β”€ macros/
β”‚   β”œβ”€β”€ limit_in_dev.sql
β”‚   β”œβ”€β”€ generate_date_parts.sql
β”‚   β”œβ”€β”€ safe_divide.sql
β”‚   └── logging_hooks.sql
β”œβ”€β”€ labs/                  # πŸ†• Hands-on exercises (6 labs)
β”‚   β”œβ”€β”€ lab-01/            # Custom Generic Tests
β”‚   β”œβ”€β”€ lab-02/            # Data Contracts & Unit Tests
β”‚   β”œβ”€β”€ lab-03/            # Creating DRY Macros
β”‚   β”œβ”€β”€ lab-04/            # Packages & Audit Helper
β”‚   β”œβ”€β”€ lab-05/            # Setting Up Slim CI
β”‚   └── lab-06/            # Analyzing Pipeline Performance
β”œβ”€β”€ seeds/                 # Reference data
β”œβ”€β”€ snapshots/             # SCD Type 2
β”œβ”€β”€ analyses/              # Ad-hoc queries
└── .github/workflows/     # CI/CD pipelines
    β”œβ”€β”€ ci.yml             # PR validation
    └── deploy.yml         # Production deployment

πŸ§ͺ Labs Overview

This module includes 6 hands-on labs (2 per session) with complete step-by-step instructions.

Session 1: Building Robust Tests

Lab Topic Duration Description
Lab 01 Custom Generic Tests 25 min Create reusable is_recent test macro
Lab 02 Data Contracts & Unit Tests 30 min Enforce schema + test business logic

Session 2: Macros & Packages

Lab Topic Duration Description
Lab 03 Creating DRY Macros 30 min Build generate_date_columns macro
Lab 04 Packages & Audit Helper 30 min Use dbt_utils and compare model versions

Session 3: CI/CD & Production

Lab Topic Duration Description
Lab 05 Setting Up Slim CI 30 min Configure GitHub Actions + state selection
Lab 06 Analyzing Pipeline Performance 25 min Parse artifacts + identify bottlenecks

Lab Structure

Each lab folder contains:

lab-XX/
β”œβ”€β”€ README.md        # πŸ“– Full instructions, background, step-by-step guide
β”œβ”€β”€ starter/         # πŸ’‘ Template files to complete
β”‚   └── *.sql/py/yml
└── solution/        # βœ… Complete working solutions
    └── *.sql/py/yml

πŸ§ͺ Testing Features

Schema Tests (dbt_expectations)

columns:
  - name: order_total
    tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          max_value: 50000

Custom Generic Tests

{% test is_positive(model, column_name) %}
select * from {{ model }}
where {{ column_name }} <= 0
{% endtest %}

Data Contracts (dbt 1.5+)

models:
  - name: dim_customers
    config:
      contract:
        enforced: true
    columns:
      - name: customer_id
        data_type: string
        constraints:
          - type: not_null
          - type: primary_key

Unit Tests (dbt 1.8+)

unit_tests:
  - name: test_vip_segment
    model: dim_customers
    given:
      - input: ref('stg_orders')
        rows:
          - {customer_id: 'c1', order_total: 600}
    expect:
      rows:
        - {customer_id: 'c1', customer_segment: 'VIP'}

πŸ”§ Key Commands

# Build everything
dbt build

# Run only modified + downstream (Slim CI)
dbt build --select state:modified+ --state prod-manifest/

# Test specific model
dbt test --select dim_customers

# Run unit tests only
dbt test --select test_type:unit

# Run generic tests only
dbt test --select test_type:generic

# Generate documentation
dbt docs generate && dbt docs serve

# Check source freshness
dbt source freshness

# Compile to see generated SQL
dbt compile --select model_name

πŸ“¦ Packages Used

Package Version Purpose
dbt_utils 1.1.1 Utilities (surrogate_key, date_spine)
dbt_expectations 0.10.3 Great Expectations-style tests
dbt_audit_helper 0.9.0 Compare model outputs
codegen 0.12.1 Generate schema YAML

Install packages:

dbt deps

🌐 CI/CD Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Branch    │────▢│  Pull Request│────▢│    Main     β”‚
β”‚  (develop)  β”‚     β”‚   (CI test)  β”‚     β”‚  (deploy)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚                    β”‚                    β”‚
      β–Ό                    β–Ό                    β–Ό
   dbt build          Slim CI:              Full build
   (local)         state:modified+         + artifacts

πŸ“Š Lab Completion Tracking

Use this checklist to track your progress:

  • Lab 01: Custom Generic Tests - is_recent test created and running
  • Lab 02: Data Contracts - Contract enforced on dim_customers, unit tests passing
  • Lab 03: DRY Macros - generate_date_columns macro compiles correctly
  • Lab 04: Audit Helper - Model comparison executed successfully
  • Lab 05: Slim CI - GitHub Actions workflow configured and tested
  • Lab 06: Performance Analysis - Bottlenecks identified and recommendations generated

πŸŽ“ Learning Outcomes

By completing all labs, you will demonstrate the ability to:

Outcome Verified In
Build a pipeline of models All labs
Apply testing to the models Labs 01, 02
Pick out problems when alerts fire Lab 06
Review a pipeline and locate bottlenecks Lab 06
Create reusable, maintainable code Labs 03, 04
Set up production-grade CI/CD Lab 05

πŸ“š Additional Resources


πŸ“ License

Dataset: CC BY-NC-SA 4.0 (Olist via Kaggle)


Built for Coding Black Females - Advanced dbt Module πŸš€

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages