riprap-nyc / docs /EMISSIONS.md
seriffic's picture
docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING
7cb5930

Per-query inference energy ledger

Riprap surfaces the energy and token cost of every inference call it makes during a briefing. The numbers are measured off the L4 GPU when the inference Space is reachable β€” not data-sheet estimates.

5 Stones Β· 21 fired Β· 11 evidence cards Β· 14.0s wall-clock Β· βœ“ 1.4 Wh / 6.9K tok inference

The chip on the Findings status row reports total energy (Wh) plus total tokens. The leading icon discloses how the number was derived:

Icon Meaning
βœ“ All recorded calls came back with a real NVML reading from the L4 GPU
◐ Some calls measured, others fell back to the data-sheet estimate
~ All calls used the data-sheet estimate (proxy unreachable, NVML disabled, or local-only run)

Hover the chip for the full breakdown β€” call count, hardware, prompt vs completion split, and the method.


What's measured vs. what's estimated

Field Source
duration_s Real wallclock on the client side (time.monotonic around each call)
prompt_tokens, completion_tokens Reported by the model server (LiteLLM usage block) for non-stream LLM calls
completion_tokens (streaming) Estimated as len(response_text) / 4 when the backend doesn't surface a final usage block (Ollama path)
power_w Measured β€” nvmlDeviceGetPowerUsage on the L4 inference Space, sampled every 100 ms, mean of samples bracketing each call
wh, joules power_w Γ— duration_s (when measured: true) or data-sheet_W Γ— duration_s (when measured: false)

Each call record on the ledger carries a measured: bool flag plus the exact power_w value used so a reviewer can audit any row.


How the measurement works

The L4 inference Space (msradam/riprap-vllm) runs a FastAPI proxy in front of vLLM (port 8000) and the riprap-models EO service (port 7861). The proxy initialises NVML at startup and runs a background sampler that reads nvmlDeviceGetPowerUsage every 100 ms into a 60-second ring buffer.

inference-vllm/proxy.py::_power_sampler
  β”œβ”€β”€ NVML init at startup, single L4 device handle
  β”œβ”€β”€ 100 ms ring buffer (600 samples = 60 s of history)
  └── degrades to no-op if NVML init fails

When the proxy forwards a POST to vLLM or riprap-models, it stamps the upstream call window (t0, t1) and computes the mean power across the samples that fall inside that window. The result lands on the response as headers:

X-GPU-Power-W      mean draw in watts
X-GPU-Energy-J     energy in joules over the window
X-GPU-Duration-S   forwarded-call duration in seconds
X-GPU-Device       "NVIDIA L4"

app/inference.py::_post() reads those headers off the proxy response and forwards them into emissions.Tracker.record_ml. The tracker stamps measured=True and uses the exact joule value.

For the LLM client path (app/llm.py::chat()) we route through LiteLLM, which doesn't surface response headers. So instead the client brackets the call with two GETs to /v1/power:

p0 = _sample_gpu_power_w()                # ~50 ms, returns 1 s avg
t0 = time.monotonic()
resp = _router.completion(...)            # the actual LLM call
duration_s = time.monotonic() - t0
p1 = _sample_gpu_power_w()                # ~50 ms, returns 1 s avg
avg = (p0 + p1) / 2

avg is the average power during the call; avg Γ— duration_s gives joules. The tracker records power_w_real=avg, joules_real=avgΓ—duration_s, and measured=True.


Hardware profiles (app/emissions.HARDWARE)

The fallback path uses a sustained-power figure from the hardware data sheet when no real measurement is available:

Key Label Sustained W Source
nvidia_l4 NVIDIA L4 60 L4 data sheet (72 W TGP, Ada Lovelace)
amd_mi300x AMD MI300X 600 MI300X data sheet (750 W TDP); used when RIPRAP_HARDWARE_LABEL=AMD MI300X
nvidia_t4 NVIDIA T4 50 T4 data sheet (70 W max)
apple_m Apple M-series 20 ml.energy / community measurements
cpu_server x86 CPU 30 Typical sustained server-core load

The fallback only fires when the proxy is unreachable, NVML init failed, or the call streamed (we currently don't measure streamed LLM calls precisely; they bracket-sample as best-effort).


End-to-end shape

Lablab UI Space (cpu-basic, FastAPI + SvelteKit)
   β”‚
   β”‚  Tracker installed per-query in web/main.py:
   β”‚  install(Tracker())
   β”‚
   β”œβ”€β”€ planner       β€” app/llm.py::chat
   β”‚                   β”œβ”€ GET /v1/power  (bracket-start)
   β”‚                   β”œβ”€ POST /v1/chat/completions
   β”‚                   └─ GET /v1/power  (bracket-end)
   β”‚
   β”œβ”€β”€ FSM specialists β€” app/inference.py::_post
   β”‚                     POST /v1/{prithvi-pluvial, terramind, ...}
   β”‚                     ← X-GPU-Power-W, X-GPU-Energy-J headers
   β”‚
   └── reconciler    β€” app/llm.py::chat (Mellea-validated)
                       same bracket pattern as planner
                  β”‚
                  β–Ό
       Tracker.summarize() β†’ emissions block on /api/agent/stream final
                  β”‚
                  β–Ό
       SvelteKit RunHealthStrip β€” chip rendered with measured-icon

Verifying

scripts/probe_stones_fire.py runs an end-to-end address query against the lablab UI and asserts:

  1. All five Stones fire
  2. No specialist returns the legacy dep-regression strings (torchvision::nms, deps unavailable on this deployment: terratorch)
  3. The final emissions block carries nvidia_l4 hardware and non-zero tokens
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600

The first call after a Space restart pays a ~120 s vLLM CUDA-graph compile penalty; warm queries land at < 0.5 Wh / ~7 K tokens.


Why this matters

Inference cost is usually invisible. AI tools that publish a "green" or "low-energy" claim mostly cite a vendor data sheet or a research mean. Riprap reports the actual joules drawn off the device under the load of a single user query β€” auditable down to the row.

The raw ledger is shipped on the SSE final event under emissions.calls, so any consumer (dashboard, billing model, reproducibility check) can reuse the data without round-tripping back through Riprap.