---
title: CommercePipeline
emoji: "📊"
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 8501
pinned: true
short_description: E-commerce data pipeline to a live BI dashboard
---

# CommercePipeline

A batteries-included e-commerce analytics pipeline that turns raw operational data into trustworthy business marts — with data-quality gates that fail the build when the numbers can't be trusted.

## The problem

Analytics dashboards are only as good as the data behind them. In most teams the path from raw operational tables to a "revenue by day" chart is a pile of ad-hoc notebooks and untested SQL: no lineage, no tests, and no gate that stops a broken extract from silently poisoning every downstream metric. The result is dashboards nobody trusts and incidents nobody catches until a number looks wrong in a meeting.

CommercePipeline is a compact, end-to-end reference for doing it properly — ingestion, a warehouse, layered SQL models, automated data-quality gates, orchestration, and a dashboard — that runs locally in about a second with zero paid services.

## What it does

It generates a realistic synthetic e-commerce dataset, loads it into a DuckDB warehouse, builds staged SQL models (staging → marts), enforces data-quality gates that **halt the pipeline on bad data**, and serves the marts through a Streamlit dashboard.

```mermaid
flowchart LR
    subgraph ingest["1 · Ingest"]
        gen["Seeded generator<br/>(numpy)"] --> raw[("Raw files<br/>Parquet + CSV<br/>customers · products<br/>orders · order_items · events")]
    end

    subgraph load["2 · Load"]
        raw --> rawdb[("DuckDB<br/>schema: raw")]
    end

    subgraph transform["3 · Transform (SQL)"]
        rawdb --> stg["staging<br/>stg_customers · stg_products<br/>stg_orders · stg_order_items · stg_events"]
        stg --> intm["int_order_revenue"]
        intm --> marts[("marts<br/>daily_revenue · top_products<br/>customer_cohort_retention · funnel_conversion")]
    end

    subgraph quality["4 · Quality gate"]
        marts --> checks{"not-null · unique<br/>ranges · accepted values<br/>referential integrity"}
        checks -->|pass| ok([pipeline succeeds])
        checks -->|fail| stop([exit 1 — build fails])
    end

    marts --> dash["Streamlit dashboard<br/>revenue · products · funnel · cohorts"]
```

The four stages are composed by a dependency-free flow (`pipeline.flow`), exposed on a CLI (`python -m pipeline run`), wired as a Makefile DAG, and — optionally — as a Prefect flow (`pipeline.orchestrate`).

## Results / impact

A full run on the default seed (`make pipeline`) produces, in **~0.7s** end to end:

| Metric | Value |
| --- | --- |
| Raw rows generated | **103,839** across 5 tables (2,000 customers · 120 products · 12,000 orders · 30,120 order items · ~59,600 events) |
| Marts produced | **5** — `daily_revenue`, `top_products`, `customer_cohort_retention`, `funnel_conversion` (+ `int_order_revenue`) |
| Data-quality gates | **16 / 16 passing** (not-null, uniqueness, accepted ranges, accepted values, referential integrity, mart sanity) |
| Modelled revenue | ~$2.46M across 352 active days, ~35% view→purchase funnel conversion |
| Tests | **19 passing** in ~1.1s (`pytest -q`) |

Correctness is enforced, not assumed: tests assert exact aggregates on a known fixture, that the quality gate catches injected bad rows, and that two business invariants reconcile end to end — **funnel purchases == completed orders**, and **mart revenue == sum of completed line items**.

## Quickstart

```bash
# 1. install (lean: duckdb, pandas, pyarrow, streamlit, pytest)
pip install -r requirements.txt

# 2. run the whole pipeline: ingest -> load -> transform -> quality gate
make pipeline          # or: python -m pipeline run

# 3. run the tests
make test              # or: pytest -q

# 4. launch the dashboard (http://localhost:8501)
make dashboard         # or: streamlit run dashboard/app.py

# ...or do the pipeline + dashboard in one shot
make demo
```

Individual stages are addressable too: `python -m pipeline {ingest,load,transform,quality}`. Dataset size and seed are configurable via env vars, e.g. `CP_N_ORDERS=50000 CP_SEED=7 make pipeline`.

## Tech stack

- **Python 3.9+** — typed, dependency-light standard-library code
- **DuckDB** — in-process analytical warehouse (no server to run)
- **SQL** — layered staging → intermediate → mart models in plain `.sql` files
- **pandas / pyarrow** — synthetic data generation and Parquet I/O
- **Streamlit + Altair** — the analytics dashboard
- **pytest** — fixture-based transform tests + data-quality gate tests
- **Make** — the orchestration DAG (optional **Prefect** flow included)
- **Docker / docker-compose** and **GitHub Actions** — reproducible build + CI

## Deploy

**Container (one command):**

```bash
docker compose up --build      # builds the warehouse, serves the dashboard on :8501
```

The image runs the full pipeline at build and start time, then serves the Streamlit app with a health check on `/_stcore/health`.

**Streamlit Community Cloud:** push this repo to GitHub, create an app pointing at `dashboard/app.py`, and add `requirements.txt` as the dependency file. The app builds the warehouse on first load if one isn't present, so no external storage is required. (For a heavier dataset, run the pipeline in a build step and commit the `data/warehouse/commerce.duckdb` artifact.)

**CI:** `.github/workflows/ci.yml` runs the pipeline and the test suite on Python 3.9 / 3.11 / 3.12 on every push and PR, and uploads the generated warehouse as a build artifact.

## Dashboard

The Streamlit app reads the marts the quality gate signed off on and presents
them as a polished, data-forward BI product — a branded header, a bento-style
KPI grid with **pipeline-health proof cards** (rows processed, marts built,
data-quality gates passing) surfaced alongside the business KPIs, monospace
numerals for every figure, styled Altair charts in one cohesive teal palette,
and a lineage + per-check quality-gate view of the ingest → load → transform →
quality-gate flow. The theme (a confident teal accent on slate neutrals) lives
in `.streamlit/config.toml`.

```bash
make dashboard          # or: streamlit run dashboard/app.py  →  http://localhost:8501
```

<!-- Screenshots: revenue trend · top products · conversion funnel · cohort retention heatmap -->
_Add dashboard screenshots here (revenue trend, top products, conversion funnel, cohort retention heatmap)._