| # Deployment topology |
|
|
| Riprap is composed of two HF Spaces in production. The **UI Space** |
| is CPU-only and contains the FastAPI + SvelteKit front-end; the |
| **inference Space** is an L4 GPU and runs vLLM (Granite 4.1 8B FP8) |
| plus the EO model stack co-resident. |
|
|
| ``` |
| ┌──────────────────────────────────────────────┐ |
| │ msradam/riprap-vllm (NVIDIA L4, 24 GB) │ |
| │ │ |
| │ :7860 proxy.py bearer-auth FastAPI │ |
| │ ├─ /v1/chat/* /v1/embeddings → :8000 │ |
| │ └─ /v1/{prithvi,terramind,...} → :7861 │ |
| │ └─ /v1/power NVML readings │ |
| │ │ |
| │ :8000 vLLM Granite 4.1 8B FP8 │ |
| │ :7861 riprap-models │ |
| │ Prithvi-EO 2.0 NYC-Pluvial │ |
| │ TerraMind LULC + Buildings │ |
| │ Granite TTM r2 │ |
| │ GLiNER + Granite Embedding 278M │ |
| └────────────────────▲─────────────────────────┘ |
| │ bearer-auth HTTPS |
| │ |
| ┌─────────────────────────────┴──────────────────────────┐ |
| │ lablab-ai-amd-developer-hackathon/riprap-nyc │ |
| │ Hackathon submission UI · cpu-basic │ |
| │ │ |
| │ FastAPI (web/main.py) + SvelteKit static build │ |
| │ Burr FSM (app/fsm.py) │ |
| │ │ |
| │ RIPRAP_LLM_BASE_URL = …/v1 │ |
| │ RIPRAP_ML_BASE_URL = … │ |
| └────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| The UI Space holds no GPU weights and contacts no commercial APIs. |
| Every model call routes through the bearer-authenticated proxy on |
| the inference Space. |
|
|
| --- |
|
|
| ## Hugging Face Spaces |
|
|
| ### `lablab-ai-amd-developer-hackathon/riprap-nyc` — UI Space |
|
|
| The hackathon submission. CPU-basic tier. Image built from the root |
| `Dockerfile`. Holds no model weights — every inference call goes |
| remote via env vars. |
|
|
| **Required Space variables:** |
|
|
| ``` |
| RIPRAP_LLM_PRIMARY = vllm |
| RIPRAP_LLM_BASE_URL = https://msradam-riprap-vllm.hf.space/v1 |
| RIPRAP_LLM_VLLM_8B_NAME = granite4.1:8b |
| RIPRAP_ML_BACKEND = remote |
| RIPRAP_ML_BASE_URL = https://msradam-riprap-vllm.hf.space |
| RIPRAP_NYCHA_REGISTERS = 1 |
| RIPRAP_HEAVY_SPECIALISTS = 1 |
| RIPRAP_PRITHVI_LIVE_ENABLE= 1 |
| RIPRAP_TERRAMIND_ENABLE = 1 |
| RIPRAP_EO_CHIP_ENABLE = 1 |
| ``` |
|
|
| **Required secrets** (set via Settings → Variables and secrets): |
|
|
| ``` |
| RIPRAP_LLM_API_KEY bearer token shared with the inference Space |
| RIPRAP_ML_API_KEY bearer token shared with the inference Space |
| HF_TOKEN for register / catalog downloads |
| ``` |
|
|
| ### `msradam/riprap-vllm` — Inference Space |
|
|
| L4 (`l4x1`) tier. Image built from `inference-vllm/Dockerfile`. |
| Bakes Granite 4.1 8B FP8 weights and the EO model dependencies |
| (terratorch + peft + diffusers + segmentation-models-pytorch + |
| nvidia-ml-py for NVML power sampling). |
|
|
| **Required secret:** |
|
|
| ``` |
| RIPRAP_PROXY_TOKEN bearer token; must match RIPRAP_LLM_API_KEY / |
| RIPRAP_ML_API_KEY on the UI Spaces |
| ``` |
|
|
| **Endpoints:** |
|
|
| | Path | Routes to | Notes | |
| |---|---|---| |
| | `POST /v1/chat/completions` | vLLM | Granite 4.1 8B FP8, OpenAI-compat | |
| | `POST /v1/completions` | vLLM | OpenAI-compat | |
| | `GET /v1/models` | vLLM | served-model-name family | |
| | `POST /v1/embeddings` | riprap-models | Granite Embedding 278M | |
| | `POST /v1/prithvi-pluvial` | riprap-models | Prithvi-EO 2.0 NYC-Pluvial | |
| | `POST /v1/terramind` | riprap-models | TerraMind LULC / Buildings / synthesis | |
| | `POST /v1/ttm-forecast` | riprap-models | Granite TTM r2 + Battery surge | |
| | `POST /v1/gliner-extract` | riprap-models | GLiNER typed-entity | |
| | `GET /v1/power` | proxy | Real NVML power (W) — see `docs/EMISSIONS.md` | |
| | `GET /healthz` | proxy + both backends | Aggregates health status | |
|
|
| All `/v1/*` endpoints require `Authorization: Bearer <PROXY_TOKEN>`. |
| `/v1/power` and the bracket-sampling LLM client path are described |
| in [`docs/EMISSIONS.md`](EMISSIONS.md). |
|
|
| --- |
|
|
| ## Personal mirror — `msradam/riprap` |
|
|
| Self-contained L4 mirror that runs the full stack (UI + vLLM + EO |
| models) in a single container. Used for parallel demos when the |
| shared inference Space is busy. Built from `Dockerfile.l4`. |
|
|
| ```bash |
| scripts/deploy_personal_space.sh |
| ``` |
|
|
| This is paused by default for the hackathon period to keep the L4 |
| budget on the primary inference Space. |
|
|
| --- |
|
|
| ## Local development |
|
|
| ### Pure local (Ollama) |
|
|
| ```bash |
| uv venv && uv pip install -r requirements.txt |
| cd web/sveltekit && npm ci && npm run build && cd ../.. |
| ollama pull granite4.1:3b |
| ollama pull granite4.1:8b |
| .venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860 |
| ``` |
|
|
| Visit `http://127.0.0.1:7860`. Inference runs locally — no GPU |
| power readings (the chip will display the data-sheet estimate with |
| a `~` icon). |
|
|
| ### Local UI, remote inference |
|
|
| ```bash |
| RIPRAP_LLM_PRIMARY=vllm \ |
| RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \ |
| RIPRAP_LLM_API_KEY=<token> \ |
| RIPRAP_ML_BACKEND=remote \ |
| RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \ |
| RIPRAP_ML_API_KEY=<token> \ |
| .venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860 |
| ``` |
|
|
| Same flow as the hosted UI Space, but rendered locally. Real NVML |
| power readings come back through the proxy headers and bracket |
| samples just like in production. |
|
|
| --- |
|
|
| ## Deploy commands |
|
|
| | Target | Script | Notes | |
| |---|---|---| |
| | Inference Space (`msradam/riprap-vllm`) | `scripts/deploy_vllm_space.sh` | Orphan-branch push from `inference-vllm/` | |
| | UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`) | cherry-pick onto `huggingface/main` then `git push huggingface` | HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it | |
| | Personal mirror (`msradam/riprap`) | `scripts/deploy_personal_space.sh` | Orphan-branch push from `Dockerfile.l4` | |
| | Inference fallback (`msradam/riprap-inference`) | `scripts/deploy_inference_space.sh` | Ollama-backed mirror; redundant when riprap-vllm is up | |
|
|
| --- |
|
|
| ## Verifying a deploy |
|
|
| ```bash |
| PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600 |
| ``` |
|
|
| Asserts: all five Stones fire, no torchvision/terratorch dep |
| regression, the `emissions` block reports `nvidia_l4` hardware, and |
| real NVML measurements come through (`n_measured` ≈ `n_calls`). |
|
|
| The address probe sweeps the full canonical set (5 NYC addresses): |
|
|
| ```bash |
| .venv/bin/python scripts/probe_addresses.py \ |
| --base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space |
| ``` |
|
|
| --- |
|
|
| ## Historical notes |
|
|
| The hackathon submission was originally built against an AMD MI300X |
| DigitalOcean droplet (running both vLLM and the EO model service). |
| The droplet was decommissioned **2026-05-06** and inference moved |
| to the L4 HF Spaces above. The bring-up runbook for the MI300X |
| droplet is preserved in [`docs/DROPLET-RUNBOOK.md`](DROPLET-RUNBOOK.md) |
| for anyone reproducing the original AMD-judging setup; setting |
| `RIPRAP_HARDWARE_LABEL=AMD MI300X` on a droplet redeploy will swap |
| the emissions ledger back to the MI300X data-sheet figures. |
|
|