# Per-query inference energy ledger Riprap surfaces the energy and token cost of every inference call it makes during a briefing. The numbers are **measured off the L4 GPU** when the inference Space is reachable — not data-sheet estimates. ``` 5 Stones · 21 fired · 11 evidence cards · 14.0s wall-clock · ✓ 1.4 Wh / 6.9K tok inference ``` The chip on the Findings status row reports total energy (Wh) plus total tokens. The leading icon discloses how the number was derived: | Icon | Meaning | |---|---| | `✓` | All recorded calls came back with a real NVML reading from the L4 GPU | | `◐` | Some calls measured, others fell back to the data-sheet estimate | | `~` | All calls used the data-sheet estimate (proxy unreachable, NVML disabled, or local-only run) | Hover the chip for the full breakdown — call count, hardware, prompt vs completion split, and the method. --- ## What's measured vs. what's estimated | Field | Source | |---|---| | `duration_s` | Real wallclock on the client side (`time.monotonic` around each call) | | `prompt_tokens`, `completion_tokens` | Reported by the model server (LiteLLM `usage` block) for non-stream LLM calls | | `completion_tokens` (streaming) | Estimated as `len(response_text) / 4` when the backend doesn't surface a final usage block (Ollama path) | | `power_w` | **Measured** — `nvmlDeviceGetPowerUsage` on the L4 inference Space, sampled every 100 ms, mean of samples bracketing each call | | `wh`, `joules` | `power_w × duration_s` (when `measured: true`) or `data-sheet_W × duration_s` (when `measured: false`) | Each call record on the ledger carries a `measured: bool` flag plus the exact `power_w` value used so a reviewer can audit any row. --- ## How the measurement works The L4 inference Space (`msradam/riprap-vllm`) runs a FastAPI proxy in front of vLLM (port 8000) and the riprap-models EO service (port 7861). The proxy initialises NVML at startup and runs a background sampler that reads `nvmlDeviceGetPowerUsage` every 100 ms into a 60-second ring buffer. ``` inference-vllm/proxy.py::_power_sampler ├── NVML init at startup, single L4 device handle ├── 100 ms ring buffer (600 samples = 60 s of history) └── degrades to no-op if NVML init fails ``` When the proxy forwards a POST to vLLM or riprap-models, it stamps the upstream call window `(t0, t1)` and computes the mean power across the samples that fall inside that window. The result lands on the response as headers: ``` X-GPU-Power-W mean draw in watts X-GPU-Energy-J energy in joules over the window X-GPU-Duration-S forwarded-call duration in seconds X-GPU-Device "NVIDIA L4" ``` `app/inference.py::_post()` reads those headers off the proxy response and forwards them into `emissions.Tracker.record_ml`. The tracker stamps `measured=True` and uses the exact joule value. For the LLM client path (`app/llm.py::chat()`) we route through LiteLLM, which doesn't surface response headers. So instead the client brackets the call with two GETs to `/v1/power`: ```python p0 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg t0 = time.monotonic() resp = _router.completion(...) # the actual LLM call duration_s = time.monotonic() - t0 p1 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg avg = (p0 + p1) / 2 ``` `avg` is the average power during the call; `avg × duration_s` gives joules. The tracker records `power_w_real=avg`, `joules_real=avg×duration_s`, and `measured=True`. --- ## Hardware profiles (`app/emissions.HARDWARE`) The fallback path uses a sustained-power figure from the hardware data sheet when no real measurement is available: | Key | Label | Sustained W | Source | |---|---|---|---| | `nvidia_l4` | NVIDIA L4 | 60 | L4 data sheet (72 W TGP, Ada Lovelace) | | `amd_mi300x` | AMD MI300X | 600 | MI300X data sheet (750 W TDP); used when `RIPRAP_HARDWARE_LABEL=AMD MI300X` | | `nvidia_t4` | NVIDIA T4 | 50 | T4 data sheet (70 W max) | | `apple_m` | Apple M-series | 20 | ml.energy / community measurements | | `cpu_server` | x86 CPU | 30 | Typical sustained server-core load | The fallback only fires when the proxy is unreachable, NVML init failed, or the call streamed (we currently don't measure streamed LLM calls precisely; they bracket-sample as best-effort). --- ## End-to-end shape ``` Lablab UI Space (cpu-basic, FastAPI + SvelteKit) │ │ Tracker installed per-query in web/main.py: │ install(Tracker()) │ ├── planner — app/llm.py::chat │ ├─ GET /v1/power (bracket-start) │ ├─ POST /v1/chat/completions │ └─ GET /v1/power (bracket-end) │ ├── FSM specialists — app/inference.py::_post │ POST /v1/{prithvi-pluvial, terramind, ...} │ ← X-GPU-Power-W, X-GPU-Energy-J headers │ └── reconciler — app/llm.py::chat (Mellea-validated) same bracket pattern as planner │ ▼ Tracker.summarize() → emissions block on /api/agent/stream final │ ▼ SvelteKit RunHealthStrip — chip rendered with measured-icon ``` --- ## Verifying `scripts/probe_stones_fire.py` runs an end-to-end address query against the lablab UI and asserts: 1. All five Stones fire 2. No specialist returns the legacy dep-regression strings (`torchvision::nms`, `deps unavailable on this deployment: terratorch`) 3. The final `emissions` block carries `nvidia_l4` hardware and non-zero tokens ```bash PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600 ``` The first call after a Space restart pays a ~120 s vLLM CUDA-graph compile penalty; warm queries land at < 0.5 Wh / ~7 K tokens. --- ## Why this matters Inference cost is usually invisible. AI tools that publish a "green" or "low-energy" claim mostly cite a vendor data sheet or a research mean. Riprap reports the actual joules drawn off the device under the load of a single user query — auditable down to the row. The raw ledger is shipped on the SSE `final` event under `emissions.calls`, so any consumer (dashboard, billing model, reproducibility check) can reuse the data without round-tripping back through Riprap.