Spaces:
Sleeping
title: CommercePipeline
emoji: 📊
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 8501
pinned: true
short_description: E-commerce data pipeline to a live BI dashboard
CommercePipeline
A batteries-included e-commerce analytics pipeline that turns raw operational data into trustworthy business marts — with data-quality gates that fail the build when the numbers can't be trusted.
The problem
Analytics dashboards are only as good as the data behind them. In most teams the path from raw operational tables to a "revenue by day" chart is a pile of ad-hoc notebooks and untested SQL: no lineage, no tests, and no gate that stops a broken extract from silently poisoning every downstream metric. The result is dashboards nobody trusts and incidents nobody catches until a number looks wrong in a meeting.
CommercePipeline is a compact, end-to-end reference for doing it properly — ingestion, a warehouse, layered SQL models, automated data-quality gates, orchestration, and a dashboard — that runs locally in about a second with zero paid services.
What it does
It generates a realistic synthetic e-commerce dataset, loads it into a DuckDB warehouse, builds staged SQL models (staging → marts), enforces data-quality gates that halt the pipeline on bad data, and serves the marts through a Streamlit dashboard.
flowchart LR
subgraph ingest["1 · Ingest"]
gen["Seeded generator<br/>(numpy)"] --> raw[("Raw files<br/>Parquet + CSV<br/>customers · products<br/>orders · order_items · events")]
end
subgraph load["2 · Load"]
raw --> rawdb[("DuckDB<br/>schema: raw")]
end
subgraph transform["3 · Transform (SQL)"]
rawdb --> stg["staging<br/>stg_customers · stg_products<br/>stg_orders · stg_order_items · stg_events"]
stg --> intm["int_order_revenue"]
intm --> marts[("marts<br/>daily_revenue · top_products<br/>customer_cohort_retention · funnel_conversion")]
end
subgraph quality["4 · Quality gate"]
marts --> checks{"not-null · unique<br/>ranges · accepted values<br/>referential integrity"}
checks -->|pass| ok([pipeline succeeds])
checks -->|fail| stop([exit 1 — build fails])
end
marts --> dash["Streamlit dashboard<br/>revenue · products · funnel · cohorts"]
The four stages are composed by a dependency-free flow (pipeline.flow), exposed on a CLI (python -m pipeline run), wired as a Makefile DAG, and — optionally — as a Prefect flow (pipeline.orchestrate).
Results / impact
A full run on the default seed (make pipeline) produces, in ~0.7s end to end:
| Metric | Value |
|---|---|
| Raw rows generated | 103,839 across 5 tables (2,000 customers · 120 products · 12,000 orders · 30,120 order items · ~59,600 events) |
| Marts produced | 5 — daily_revenue, top_products, customer_cohort_retention, funnel_conversion (+ int_order_revenue) |
| Data-quality gates | 16 / 16 passing (not-null, uniqueness, accepted ranges, accepted values, referential integrity, mart sanity) |
| Modelled revenue | ~$2.46M across 352 active days, ~35% view→purchase funnel conversion |
| Tests | 19 passing in ~1.1s (pytest -q) |
Correctness is enforced, not assumed: tests assert exact aggregates on a known fixture, that the quality gate catches injected bad rows, and that two business invariants reconcile end to end — funnel purchases == completed orders, and mart revenue == sum of completed line items.
Quickstart
# 1. install (lean: duckdb, pandas, pyarrow, streamlit, pytest)
pip install -r requirements.txt
# 2. run the whole pipeline: ingest -> load -> transform -> quality gate
make pipeline # or: python -m pipeline run
# 3. run the tests
make test # or: pytest -q
# 4. launch the dashboard (http://localhost:8501)
make dashboard # or: streamlit run dashboard/app.py
# ...or do the pipeline + dashboard in one shot
make demo
Individual stages are addressable too: python -m pipeline {ingest,load,transform,quality}. Dataset size and seed are configurable via env vars, e.g. CP_N_ORDERS=50000 CP_SEED=7 make pipeline.
Tech stack
- Python 3.9+ — typed, dependency-light standard-library code
- DuckDB — in-process analytical warehouse (no server to run)
- SQL — layered staging → intermediate → mart models in plain
.sqlfiles - pandas / pyarrow — synthetic data generation and Parquet I/O
- Streamlit + Altair — the analytics dashboard
- pytest — fixture-based transform tests + data-quality gate tests
- Make — the orchestration DAG (optional Prefect flow included)
- Docker / docker-compose and GitHub Actions — reproducible build + CI
Deploy
Container (one command):
docker compose up --build # builds the warehouse, serves the dashboard on :8501
The image runs the full pipeline at build and start time, then serves the Streamlit app with a health check on /_stcore/health.
Streamlit Community Cloud: push this repo to GitHub, create an app pointing at dashboard/app.py, and add requirements.txt as the dependency file. The app builds the warehouse on first load if one isn't present, so no external storage is required. (For a heavier dataset, run the pipeline in a build step and commit the data/warehouse/commerce.duckdb artifact.)
CI: .github/workflows/ci.yml runs the pipeline and the test suite on Python 3.9 / 3.11 / 3.12 on every push and PR, and uploads the generated warehouse as a build artifact.
Dashboard
The Streamlit app reads the marts the quality gate signed off on and presents
them as a polished, data-forward BI product — a branded header, a bento-style
KPI grid with pipeline-health proof cards (rows processed, marts built,
data-quality gates passing) surfaced alongside the business KPIs, monospace
numerals for every figure, styled Altair charts in one cohesive teal palette,
and a lineage + per-check quality-gate view of the ingest → load → transform →
quality-gate flow. The theme (a confident teal accent on slate neutrals) lives
in .streamlit/config.toml.
make dashboard # or: streamlit run dashboard/app.py → http://localhost:8501
Add dashboard screenshots here (revenue trend, top products, conversion funnel, cohort retention heatmap).