Spaces:

LaelaZ
/

ecommerce-data-pipeline

Sleeping

App Files Files Community

ecommerce-data-pipeline / README.md

LaelaZ

Add Space card front-matter

627ef97 verified 5 days ago

preview code

raw

history blame contribute delete

6.54 kB

metadata

title: CommercePipeline
emoji: 📊
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 8501
pinned: true
short_description: E-commerce data pipeline to a live BI dashboard

CommercePipeline

A batteries-included e-commerce analytics pipeline that turns raw operational data into trustworthy business marts — with data-quality gates that fail the build when the numbers can't be trusted.

The problem

Analytics dashboards are only as good as the data behind them. In most teams the path from raw operational tables to a "revenue by day" chart is a pile of ad-hoc notebooks and untested SQL: no lineage, no tests, and no gate that stops a broken extract from silently poisoning every downstream metric. The result is dashboards nobody trusts and incidents nobody catches until a number looks wrong in a meeting.

CommercePipeline is a compact, end-to-end reference for doing it properly — ingestion, a warehouse, layered SQL models, automated data-quality gates, orchestration, and a dashboard — that runs locally in about a second with zero paid services.

What it does

It generates a realistic synthetic e-commerce dataset, loads it into a DuckDB warehouse, builds staged SQL models (staging → marts), enforces data-quality gates that halt the pipeline on bad data, and serves the marts through a Streamlit dashboard.

flowchart LR
    subgraph ingest["1 · Ingest"]
        gen["Seeded generator<br/>(numpy)"] --> raw[("Raw files<br/>Parquet + CSV<br/>customers · products<br/>orders · order_items · events")]
    end

    subgraph load["2 · Load"]
        raw --> rawdb[("DuckDB<br/>schema: raw")]
    end

    subgraph transform["3 · Transform (SQL)"]
        rawdb --> stg["staging<br/>stg_customers · stg_products<br/>stg_orders · stg_order_items · stg_events"]
        stg --> intm["int_order_revenue"]
        intm --> marts[("marts<br/>daily_revenue · top_products<br/>customer_cohort_retention · funnel_conversion")]
    end

    subgraph quality["4 · Quality gate"]
        marts --> checks{"not-null · unique<br/>ranges · accepted values<br/>referential integrity"}
        checks -->|pass| ok([pipeline succeeds])
        checks -->|fail| stop([exit 1 — build fails])
    end

    marts --> dash["Streamlit dashboard<br/>revenue · products · funnel · cohorts"]

The four stages are composed by a dependency-free flow (pipeline.flow), exposed on a CLI (python -m pipeline run), wired as a Makefile DAG, and — optionally — as a Prefect flow (pipeline.orchestrate).

Results / impact

A full run on the default seed (make pipeline) produces, in ~0.7s end to end:

Metric	Value
Raw rows generated	103,839 across 5 tables (2,000 customers · 120 products · 12,000 orders · 30,120 order items · ~59,600 events)
Marts produced	5 — `daily_revenue`, `top_products`, `customer_cohort_retention`, `funnel_conversion` (+ `int_order_revenue`)
Data-quality gates	16 / 16 passing (not-null, uniqueness, accepted ranges, accepted values, referential integrity, mart sanity)
Modelled revenue	~$2.46M across 352 active days, ~35% view→purchase funnel conversion
Tests	19 passing in ~1.1s (`pytest -q`)

Correctness is enforced, not assumed: tests assert exact aggregates on a known fixture, that the quality gate catches injected bad rows, and that two business invariants reconcile end to end — funnel purchases == completed orders, and mart revenue == sum of completed line items.

Quickstart

# 1. install (lean: duckdb, pandas, pyarrow, streamlit, pytest)
pip install -r requirements.txt

# 2. run the whole pipeline: ingest -> load -> transform -> quality gate
make pipeline          # or: python -m pipeline run

# 3. run the tests
make test              # or: pytest -q

# 4. launch the dashboard (http://localhost:8501)
make dashboard         # or: streamlit run dashboard/app.py

# ...or do the pipeline + dashboard in one shot
make demo

Individual stages are addressable too: python -m pipeline {ingest,load,transform,quality}. Dataset size and seed are configurable via env vars, e.g. CP_N_ORDERS=50000 CP_SEED=7 make pipeline.

Tech stack

Python 3.9+ — typed, dependency-light standard-library code
DuckDB — in-process analytical warehouse (no server to run)
SQL — layered staging → intermediate → mart models in plain .sql files
pandas / pyarrow — synthetic data generation and Parquet I/O
Streamlit + Altair — the analytics dashboard
pytest — fixture-based transform tests + data-quality gate tests
Make — the orchestration DAG (optional Prefect flow included)
Docker / docker-compose and GitHub Actions — reproducible build + CI

Deploy

Container (one command):

docker compose up --build      # builds the warehouse, serves the dashboard on :8501

The image runs the full pipeline at build and start time, then serves the Streamlit app with a health check on /_stcore/health.

Streamlit Community Cloud: push this repo to GitHub, create an app pointing at dashboard/app.py, and add requirements.txt as the dependency file. The app builds the warehouse on first load if one isn't present, so no external storage is required. (For a heavier dataset, run the pipeline in a build step and commit the data/warehouse/commerce.duckdb artifact.)

CI: .github/workflows/ci.yml runs the pipeline and the test suite on Python 3.9 / 3.11 / 3.12 on every push and PR, and uploads the generated warehouse as a build artifact.

Dashboard

The Streamlit app reads the marts the quality gate signed off on and presents them as a polished, data-forward BI product — a branded header, a bento-style KPI grid with pipeline-health proof cards (rows processed, marts built, data-quality gates passing) surfaced alongside the business KPIs, monospace numerals for every figure, styled Altair charts in one cohesive teal palette, and a lineage + per-check quality-gate view of the ingest → load → transform → quality-gate flow. The theme (a confident teal accent on slate neutrals) lives in .streamlit/config.toml.

make dashboard          # or: streamlit run dashboard/app.py  →  http://localhost:8501

Add dashboard screenshots here (revenue trend, top products, conversion funnel, cohort retention heatmap).