---
title: CommercePipeline
emoji: "📊"
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 8501
pinned: true
short_description: E-commerce data pipeline to a live BI dashboard
---
# CommercePipeline
A batteries-included e-commerce analytics pipeline that turns raw operational data into trustworthy business marts — with data-quality gates that fail the build when the numbers can't be trusted.
## The problem
Analytics dashboards are only as good as the data behind them. In most teams the path from raw operational tables to a "revenue by day" chart is a pile of ad-hoc notebooks and untested SQL: no lineage, no tests, and no gate that stops a broken extract from silently poisoning every downstream metric. The result is dashboards nobody trusts and incidents nobody catches until a number looks wrong in a meeting.
CommercePipeline is a compact, end-to-end reference for doing it properly — ingestion, a warehouse, layered SQL models, automated data-quality gates, orchestration, and a dashboard — that runs locally in about a second with zero paid services.
## What it does
It generates a realistic synthetic e-commerce dataset, loads it into a DuckDB warehouse, builds staged SQL models (staging → marts), enforces data-quality gates that **halt the pipeline on bad data**, and serves the marts through a Streamlit dashboard.
```mermaid
flowchart LR
subgraph ingest["1 · Ingest"]
gen["Seeded generator
(numpy)"] --> raw[("Raw files
Parquet + CSV
customers · products
orders · order_items · events")]
end
subgraph load["2 · Load"]
raw --> rawdb[("DuckDB
schema: raw")]
end
subgraph transform["3 · Transform (SQL)"]
rawdb --> stg["staging
stg_customers · stg_products
stg_orders · stg_order_items · stg_events"]
stg --> intm["int_order_revenue"]
intm --> marts[("marts
daily_revenue · top_products
customer_cohort_retention · funnel_conversion")]
end
subgraph quality["4 · Quality gate"]
marts --> checks{"not-null · unique
ranges · accepted values
referential integrity"}
checks -->|pass| ok([pipeline succeeds])
checks -->|fail| stop([exit 1 — build fails])
end
marts --> dash["Streamlit dashboard
revenue · products · funnel · cohorts"]
```
The four stages are composed by a dependency-free flow (`pipeline.flow`), exposed on a CLI (`python -m pipeline run`), wired as a Makefile DAG, and — optionally — as a Prefect flow (`pipeline.orchestrate`).
## Results / impact
A full run on the default seed (`make pipeline`) produces, in **~0.7s** end to end:
| Metric | Value |
| --- | --- |
| Raw rows generated | **103,839** across 5 tables (2,000 customers · 120 products · 12,000 orders · 30,120 order items · ~59,600 events) |
| Marts produced | **5** — `daily_revenue`, `top_products`, `customer_cohort_retention`, `funnel_conversion` (+ `int_order_revenue`) |
| Data-quality gates | **16 / 16 passing** (not-null, uniqueness, accepted ranges, accepted values, referential integrity, mart sanity) |
| Modelled revenue | ~$2.46M across 352 active days, ~35% view→purchase funnel conversion |
| Tests | **19 passing** in ~1.1s (`pytest -q`) |
Correctness is enforced, not assumed: tests assert exact aggregates on a known fixture, that the quality gate catches injected bad rows, and that two business invariants reconcile end to end — **funnel purchases == completed orders**, and **mart revenue == sum of completed line items**.
## Quickstart
```bash
# 1. install (lean: duckdb, pandas, pyarrow, streamlit, pytest)
pip install -r requirements.txt
# 2. run the whole pipeline: ingest -> load -> transform -> quality gate
make pipeline # or: python -m pipeline run
# 3. run the tests
make test # or: pytest -q
# 4. launch the dashboard (http://localhost:8501)
make dashboard # or: streamlit run dashboard/app.py
# ...or do the pipeline + dashboard in one shot
make demo
```
Individual stages are addressable too: `python -m pipeline {ingest,load,transform,quality}`. Dataset size and seed are configurable via env vars, e.g. `CP_N_ORDERS=50000 CP_SEED=7 make pipeline`.
## Tech stack
- **Python 3.9+** — typed, dependency-light standard-library code
- **DuckDB** — in-process analytical warehouse (no server to run)
- **SQL** — layered staging → intermediate → mart models in plain `.sql` files
- **pandas / pyarrow** — synthetic data generation and Parquet I/O
- **Streamlit + Altair** — the analytics dashboard
- **pytest** — fixture-based transform tests + data-quality gate tests
- **Make** — the orchestration DAG (optional **Prefect** flow included)
- **Docker / docker-compose** and **GitHub Actions** — reproducible build + CI
## Deploy
**Container (one command):**
```bash
docker compose up --build # builds the warehouse, serves the dashboard on :8501
```
The image runs the full pipeline at build and start time, then serves the Streamlit app with a health check on `/_stcore/health`.
**Streamlit Community Cloud:** push this repo to GitHub, create an app pointing at `dashboard/app.py`, and add `requirements.txt` as the dependency file. The app builds the warehouse on first load if one isn't present, so no external storage is required. (For a heavier dataset, run the pipeline in a build step and commit the `data/warehouse/commerce.duckdb` artifact.)
**CI:** `.github/workflows/ci.yml` runs the pipeline and the test suite on Python 3.9 / 3.11 / 3.12 on every push and PR, and uploads the generated warehouse as a build artifact.
## Dashboard
The Streamlit app reads the marts the quality gate signed off on and presents
them as a polished, data-forward BI product — a branded header, a bento-style
KPI grid with **pipeline-health proof cards** (rows processed, marts built,
data-quality gates passing) surfaced alongside the business KPIs, monospace
numerals for every figure, styled Altair charts in one cohesive teal palette,
and a lineage + per-check quality-gate view of the ingest → load → transform →
quality-gate flow. The theme (a confident teal accent on slate neutrals) lives
in `.streamlit/config.toml`.
```bash
make dashboard # or: streamlit run dashboard/app.py → http://localhost:8501
```
_Add dashboard screenshots here (revenue trend, top products, conversion funnel, cohort retention heatmap)._