| # Per-query inference energy ledger |
|
|
| Riprap surfaces the energy and token cost of every inference call it |
| makes during a briefing. The numbers are **measured off the L4 GPU** |
| when the inference Space is reachable β not data-sheet estimates. |
|
|
| ``` |
| 5 Stones Β· 21 fired Β· 11 evidence cards Β· 14.0s wall-clock Β· β 1.4 Wh / 6.9K tok inference |
| ``` |
|
|
| The chip on the Findings status row reports total energy (Wh) plus |
| total tokens. The leading icon discloses how the number was derived: |
|
|
| | Icon | Meaning | |
| |---|---| |
| | `β` | All recorded calls came back with a real NVML reading from the L4 GPU | |
| | `β` | Some calls measured, others fell back to the data-sheet estimate | |
| | `~` | All calls used the data-sheet estimate (proxy unreachable, NVML disabled, or local-only run) | |
|
|
| Hover the chip for the full breakdown β call count, hardware, prompt |
| vs completion split, and the method. |
|
|
| --- |
|
|
| ## What's measured vs. what's estimated |
|
|
| | Field | Source | |
| |---|---| |
| | `duration_s` | Real wallclock on the client side (`time.monotonic` around each call) | |
| | `prompt_tokens`, `completion_tokens` | Reported by the model server (LiteLLM `usage` block) for non-stream LLM calls | |
| | `completion_tokens` (streaming) | Estimated as `len(response_text) / 4` when the backend doesn't surface a final usage block (Ollama path) | |
| | `power_w` | **Measured** β `nvmlDeviceGetPowerUsage` on the L4 inference Space, sampled every 100 ms, mean of samples bracketing each call | |
| | `wh`, `joules` | `power_w Γ duration_s` (when `measured: true`) or `data-sheet_W Γ duration_s` (when `measured: false`) | |
|
|
| Each call record on the ledger carries a `measured: bool` flag plus |
| the exact `power_w` value used so a reviewer can audit any row. |
|
|
| --- |
|
|
| ## How the measurement works |
|
|
| The L4 inference Space (`msradam/riprap-vllm`) runs a FastAPI proxy |
| in front of vLLM (port 8000) and the riprap-models EO service |
| (port 7861). The proxy initialises NVML at startup and runs a |
| background sampler that reads `nvmlDeviceGetPowerUsage` every |
| 100 ms into a 60-second ring buffer. |
|
|
| ``` |
| inference-vllm/proxy.py::_power_sampler |
| βββ NVML init at startup, single L4 device handle |
| βββ 100 ms ring buffer (600 samples = 60 s of history) |
| βββ degrades to no-op if NVML init fails |
| ``` |
|
|
| When the proxy forwards a POST to vLLM or riprap-models, it stamps |
| the upstream call window `(t0, t1)` and computes the mean power |
| across the samples that fall inside that window. The result lands |
| on the response as headers: |
|
|
| ``` |
| X-GPU-Power-W mean draw in watts |
| X-GPU-Energy-J energy in joules over the window |
| X-GPU-Duration-S forwarded-call duration in seconds |
| X-GPU-Device "NVIDIA L4" |
| ``` |
|
|
| `app/inference.py::_post()` reads those headers off the proxy |
| response and forwards them into `emissions.Tracker.record_ml`. The |
| tracker stamps `measured=True` and uses the exact joule value. |
|
|
| For the LLM client path (`app/llm.py::chat()`) we route through |
| LiteLLM, which doesn't surface response headers. So instead the |
| client brackets the call with two GETs to `/v1/power`: |
|
|
| ```python |
| p0 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg |
| t0 = time.monotonic() |
| resp = _router.completion(...) # the actual LLM call |
| duration_s = time.monotonic() - t0 |
| p1 = _sample_gpu_power_w() # ~50 ms, returns 1 s avg |
| avg = (p0 + p1) / 2 |
| ``` |
|
|
| `avg` is the average power during the call; `avg Γ duration_s` |
| gives joules. The tracker records `power_w_real=avg`, |
| `joules_real=avgΓduration_s`, and `measured=True`. |
|
|
| --- |
|
|
| ## Hardware profiles (`app/emissions.HARDWARE`) |
|
|
| The fallback path uses a sustained-power figure from the hardware |
| data sheet when no real measurement is available: |
|
|
| | Key | Label | Sustained W | Source | |
| |---|---|---|---| |
| | `nvidia_l4` | NVIDIA L4 | 60 | L4 data sheet (72 W TGP, Ada Lovelace) | |
| | `amd_mi300x` | AMD MI300X | 600 | MI300X data sheet (750 W TDP); used when `RIPRAP_HARDWARE_LABEL=AMD MI300X` | |
| | `nvidia_t4` | NVIDIA T4 | 50 | T4 data sheet (70 W max) | |
| | `apple_m` | Apple M-series | 20 | ml.energy / community measurements | |
| | `cpu_server` | x86 CPU | 30 | Typical sustained server-core load | |
|
|
| The fallback only fires when the proxy is unreachable, NVML init |
| failed, or the call streamed (we currently don't measure streamed |
| LLM calls precisely; they bracket-sample as best-effort). |
|
|
| --- |
|
|
| ## End-to-end shape |
|
|
| ``` |
| Lablab UI Space (cpu-basic, FastAPI + SvelteKit) |
| β |
| β Tracker installed per-query in web/main.py: |
| β install(Tracker()) |
| β |
| βββ planner β app/llm.py::chat |
| β ββ GET /v1/power (bracket-start) |
| β ββ POST /v1/chat/completions |
| β ββ GET /v1/power (bracket-end) |
| β |
| βββ FSM specialists β app/inference.py::_post |
| β POST /v1/{prithvi-pluvial, terramind, ...} |
| β β X-GPU-Power-W, X-GPU-Energy-J headers |
| β |
| βββ reconciler β app/llm.py::chat (Mellea-validated) |
| same bracket pattern as planner |
| β |
| βΌ |
| Tracker.summarize() β emissions block on /api/agent/stream final |
| β |
| βΌ |
| SvelteKit RunHealthStrip β chip rendered with measured-icon |
| ``` |
|
|
| --- |
|
|
| ## Verifying |
|
|
| `scripts/probe_stones_fire.py` runs an end-to-end address query |
| against the lablab UI and asserts: |
|
|
| 1. All five Stones fire |
| 2. No specialist returns the legacy dep-regression strings |
| (`torchvision::nms`, `deps unavailable on this deployment: |
| terratorch`) |
| 3. The final `emissions` block carries `nvidia_l4` hardware and |
| non-zero tokens |
|
|
| ```bash |
| PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600 |
| ``` |
|
|
| The first call after a Space restart pays a ~120 s vLLM CUDA-graph |
| compile penalty; warm queries land at < 0.5 Wh / ~7 K tokens. |
|
|
| --- |
|
|
| ## Why this matters |
|
|
| Inference cost is usually invisible. AI tools that publish a |
| "green" or "low-energy" claim mostly cite a vendor data sheet or a |
| research mean. Riprap reports the actual joules drawn off the |
| device under the load of a single user query β auditable down to |
| the row. |
|
|
| The raw ledger is shipped on the SSE `final` event under |
| `emissions.calls`, so any consumer (dashboard, billing model, |
| reproducibility check) can reuse the data without round-tripping |
| back through Riprap. |
|
|