Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

App Files Files Community

riprap-nyc / docs /EMISSIONS.md

seriffic

docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING

7cb5930 24 days ago

preview code

raw

history blame contribute delete

6.37 kB

	# Per-query inference energy ledger

	Riprap surfaces the energy and token cost of every inference call it
	makes during a briefing. The numbers are measured off the L4 GPU
	when the inference Space is reachable — not data-sheet estimates.

	```
	5 Stones · 21 fired · 11 evidence cards · 14.0s wall-clock · ✓ 1.4 Wh / 6.9K tok inference
	```

	The chip on the Findings status row reports total energy (Wh) plus
	total tokens. The leading icon discloses how the number was derived:

	\| Icon \| Meaning \|
	\|---\|---\|
	\| `✓` \| All recorded calls came back with a real NVML reading from the L4 GPU \|
	\| `◐` \| Some calls measured, others fell back to the data-sheet estimate \|
	\| `~` \| All calls used the data-sheet estimate (proxy unreachable, NVML disabled, or local-only run) \|

	Hover the chip for the full breakdown — call count, hardware, prompt
	vs completion split, and the method.

	---

	## What's measured vs. what's estimated

	\| Field \| Source \|
	\|---\|---\|
	\| `duration_s` \| Real wallclock on the client side (`time.monotonic` around each call) \|
	\| `prompt_tokens`, `completion_tokens` \| Reported by the model server (LiteLLM `usage` block) for non-stream LLM calls \|
	\| `completion_tokens` (streaming) \| Estimated as `len(response_text) / 4` when the backend doesn't surface a final usage block (Ollama path) \|
	\| `power_w` \| Measured — `nvmlDeviceGetPowerUsage` on the L4 inference Space, sampled every 100 ms, mean of samples bracketing each call \|
	\| `wh`, `joules` \| `power_w × duration_s` (when `measured: true`) or `data-sheet_W × duration_s` (when `measured: false`) \|

	Each call record on the ledger carries a `measured: bool` flag plus
	the exact `power_w` value used so a reviewer can audit any row.

	---

	## How the measurement works

	The L4 inference Space (`msradam/riprap-vllm`) runs a FastAPI proxy
	in front of vLLM (port 8000) and the riprap-models EO service
	(port 7861). The proxy initialises NVML at startup and runs a
	background sampler that reads `nvmlDeviceGetPowerUsage` every
	100 ms into a 60-second ring buffer.

	```
	inference-vllm/proxy.py::_power_sampler
	├── NVML init at startup, single L4 device handle
	├── 100 ms ring buffer (600 samples = 60 s of history)
	└── degrades to no-op if NVML init fails
	```

	When the proxy forwards a POST to vLLM or riprap-models, it stamps
	the upstream call window `(t0, t1)` and computes the mean power
	across the samples that fall inside that window. The result lands
	on the response as headers:

	```
	X-GPU-Power-W mean draw in watts
	X-GPU-Energy-J energy in joules over the window
	X-GPU-Duration-S forwarded-call duration in seconds
	X-GPU-Device "NVIDIA L4"
	```

	`app/inference.py::_post()` reads those headers off the proxy
	response and forwards them into `emissions.Tracker.record_ml`. The
	tracker stamps `measured=True` and uses the exact joule value.

	For the LLM client path (`app/llm.py::chat()`) we route through
	LiteLLM, which doesn't surface response headers. So instead the
	client brackets the call with two GETs to `/v1/power`:

	```python
	p0 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg
	t0 = time.monotonic()
	resp = _router.completion(...) # the actual LLM call
	duration_s = time.monotonic() - t0
	p1 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg
	avg = (p0 + p1) / 2
	```

	`avg` is the average power during the call; `avg × duration_s`
	gives joules. The tracker records `power_w_real=avg`,
	`joules_real=avg×duration_s`, and `measured=True`.

	---

	## Hardware profiles (`app/emissions.HARDWARE`)

	The fallback path uses a sustained-power figure from the hardware
	data sheet when no real measurement is available:

	\| Key \| Label \| Sustained W \| Source \|
	\|---\|---\|---\|---\|
	\| `nvidia_l4` \| NVIDIA L4 \| 60 \| L4 data sheet (72 W TGP, Ada Lovelace) \|
	\| `amd_mi300x` \| AMD MI300X \| 600 \| MI300X data sheet (750 W TDP); used when `RIPRAP_HARDWARE_LABEL=AMD MI300X` \|
	\| `nvidia_t4` \| NVIDIA T4 \| 50 \| T4 data sheet (70 W max) \|
	\| `apple_m` \| Apple M-series \| 20 \| ml.energy / community measurements \|
	\| `cpu_server` \| x86 CPU \| 30 \| Typical sustained server-core load \|

	The fallback only fires when the proxy is unreachable, NVML init
	failed, or the call streamed (we currently don't measure streamed
	LLM calls precisely; they bracket-sample as best-effort).

	---

	## End-to-end shape

	```
	Lablab UI Space (cpu-basic, FastAPI + SvelteKit)
	│
	│ Tracker installed per-query in web/main.py:
	│ install(Tracker())
	│
	├── planner — app/llm.py::chat
	│ ├─ GET /v1/power (bracket-start)
	│ ├─ POST /v1/chat/completions
	│ └─ GET /v1/power (bracket-end)
	│
	├── FSM specialists — app/inference.py::_post
	│ POST /v1/{prithvi-pluvial, terramind, ...}
	│ ← X-GPU-Power-W, X-GPU-Energy-J headers
	│
	└── reconciler — app/llm.py::chat (Mellea-validated)
	same bracket pattern as planner
	│
	▼
	Tracker.summarize() → emissions block on /api/agent/stream final
	│
	▼
	SvelteKit RunHealthStrip — chip rendered with measured-icon
	```

	---

	## Verifying

	`scripts/probe_stones_fire.py` runs an end-to-end address query
	against the lablab UI and asserts:

	1. All five Stones fire
	2. No specialist returns the legacy dep-regression strings
	(`torchvision::nms`, `deps unavailable on this deployment:
	terratorch`)
	3. The final `emissions` block carries `nvidia_l4` hardware and
	non-zero tokens

	```bash
	PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
	```

	The first call after a Space restart pays a ~120 s vLLM CUDA-graph
	compile penalty; warm queries land at < 0.5 Wh / ~7 K tokens.

	---

	## Why this matters

	Inference cost is usually invisible. AI tools that publish a
	"green" or "low-energy" claim mostly cite a vendor data sheet or a
	research mean. Riprap reports the actual joules drawn off the
	device under the load of a single user query — auditable down to
	the row.

	The raw ledger is shipped on the SSE `final` event under
	`emissions.calls`, so any consumer (dashboard, billing model,
	reproducibility check) can reuse the data without round-tripping
	back through Riprap.