riprap-nyc / docs /EMISSIONS.md
seriffic's picture
docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING
7cb5930
# Per-query inference energy ledger
Riprap surfaces the energy and token cost of every inference call it
makes during a briefing. The numbers are **measured off the L4 GPU**
when the inference Space is reachable β€” not data-sheet estimates.
```
5 Stones Β· 21 fired Β· 11 evidence cards Β· 14.0s wall-clock Β· βœ“ 1.4 Wh / 6.9K tok inference
```
The chip on the Findings status row reports total energy (Wh) plus
total tokens. The leading icon discloses how the number was derived:
| Icon | Meaning |
|---|---|
| `βœ“` | All recorded calls came back with a real NVML reading from the L4 GPU |
| `◐` | Some calls measured, others fell back to the data-sheet estimate |
| `~` | All calls used the data-sheet estimate (proxy unreachable, NVML disabled, or local-only run) |
Hover the chip for the full breakdown β€” call count, hardware, prompt
vs completion split, and the method.
---
## What's measured vs. what's estimated
| Field | Source |
|---|---|
| `duration_s` | Real wallclock on the client side (`time.monotonic` around each call) |
| `prompt_tokens`, `completion_tokens` | Reported by the model server (LiteLLM `usage` block) for non-stream LLM calls |
| `completion_tokens` (streaming) | Estimated as `len(response_text) / 4` when the backend doesn't surface a final usage block (Ollama path) |
| `power_w` | **Measured** β€” `nvmlDeviceGetPowerUsage` on the L4 inference Space, sampled every 100 ms, mean of samples bracketing each call |
| `wh`, `joules` | `power_w Γ— duration_s` (when `measured: true`) or `data-sheet_W Γ— duration_s` (when `measured: false`) |
Each call record on the ledger carries a `measured: bool` flag plus
the exact `power_w` value used so a reviewer can audit any row.
---
## How the measurement works
The L4 inference Space (`msradam/riprap-vllm`) runs a FastAPI proxy
in front of vLLM (port 8000) and the riprap-models EO service
(port 7861). The proxy initialises NVML at startup and runs a
background sampler that reads `nvmlDeviceGetPowerUsage` every
100 ms into a 60-second ring buffer.
```
inference-vllm/proxy.py::_power_sampler
β”œβ”€β”€ NVML init at startup, single L4 device handle
β”œβ”€β”€ 100 ms ring buffer (600 samples = 60 s of history)
└── degrades to no-op if NVML init fails
```
When the proxy forwards a POST to vLLM or riprap-models, it stamps
the upstream call window `(t0, t1)` and computes the mean power
across the samples that fall inside that window. The result lands
on the response as headers:
```
X-GPU-Power-W mean draw in watts
X-GPU-Energy-J energy in joules over the window
X-GPU-Duration-S forwarded-call duration in seconds
X-GPU-Device "NVIDIA L4"
```
`app/inference.py::_post()` reads those headers off the proxy
response and forwards them into `emissions.Tracker.record_ml`. The
tracker stamps `measured=True` and uses the exact joule value.
For the LLM client path (`app/llm.py::chat()`) we route through
LiteLLM, which doesn't surface response headers. So instead the
client brackets the call with two GETs to `/v1/power`:
```python
p0 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg
t0 = time.monotonic()
resp = _router.completion(...) # the actual LLM call
duration_s = time.monotonic() - t0
p1 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg
avg = (p0 + p1) / 2
```
`avg` is the average power during the call; `avg Γ— duration_s`
gives joules. The tracker records `power_w_real=avg`,
`joules_real=avgΓ—duration_s`, and `measured=True`.
---
## Hardware profiles (`app/emissions.HARDWARE`)
The fallback path uses a sustained-power figure from the hardware
data sheet when no real measurement is available:
| Key | Label | Sustained W | Source |
|---|---|---|---|
| `nvidia_l4` | NVIDIA L4 | 60 | L4 data sheet (72 W TGP, Ada Lovelace) |
| `amd_mi300x` | AMD MI300X | 600 | MI300X data sheet (750 W TDP); used when `RIPRAP_HARDWARE_LABEL=AMD MI300X` |
| `nvidia_t4` | NVIDIA T4 | 50 | T4 data sheet (70 W max) |
| `apple_m` | Apple M-series | 20 | ml.energy / community measurements |
| `cpu_server` | x86 CPU | 30 | Typical sustained server-core load |
The fallback only fires when the proxy is unreachable, NVML init
failed, or the call streamed (we currently don't measure streamed
LLM calls precisely; they bracket-sample as best-effort).
---
## End-to-end shape
```
Lablab UI Space (cpu-basic, FastAPI + SvelteKit)
β”‚
β”‚ Tracker installed per-query in web/main.py:
β”‚ install(Tracker())
β”‚
β”œβ”€β”€ planner β€” app/llm.py::chat
β”‚ β”œβ”€ GET /v1/power (bracket-start)
β”‚ β”œβ”€ POST /v1/chat/completions
β”‚ └─ GET /v1/power (bracket-end)
β”‚
β”œβ”€β”€ FSM specialists β€” app/inference.py::_post
β”‚ POST /v1/{prithvi-pluvial, terramind, ...}
β”‚ ← X-GPU-Power-W, X-GPU-Energy-J headers
β”‚
└── reconciler β€” app/llm.py::chat (Mellea-validated)
same bracket pattern as planner
β”‚
β–Ό
Tracker.summarize() β†’ emissions block on /api/agent/stream final
β”‚
β–Ό
SvelteKit RunHealthStrip β€” chip rendered with measured-icon
```
---
## Verifying
`scripts/probe_stones_fire.py` runs an end-to-end address query
against the lablab UI and asserts:
1. All five Stones fire
2. No specialist returns the legacy dep-regression strings
(`torchvision::nms`, `deps unavailable on this deployment:
terratorch`)
3. The final `emissions` block carries `nvidia_l4` hardware and
non-zero tokens
```bash
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
```
The first call after a Space restart pays a ~120 s vLLM CUDA-graph
compile penalty; warm queries land at < 0.5 Wh / ~7 K tokens.
---
## Why this matters
Inference cost is usually invisible. AI tools that publish a
"green" or "low-energy" claim mostly cite a vendor data sheet or a
research mean. Riprap reports the actual joules drawn off the
device under the load of a single user query β€” auditable down to
the row.
The raw ledger is shipped on the SSE `final` event under
`emissions.calls`, so any consumer (dashboard, billing model,
reproducibility check) can reuse the data without round-tripping
back through Riprap.