riprap-nyc / docs /DEPLOY.md
seriffic's picture
docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING
7cb5930
# Deployment topology
Riprap is composed of two HF Spaces in production. The **UI Space**
is CPU-only and contains the FastAPI + SvelteKit front-end; the
**inference Space** is an L4 GPU and runs vLLM (Granite 4.1 8B FP8)
plus the EO model stack co-resident.
```
┌──────────────────────────────────────────────┐
│ msradam/riprap-vllm (NVIDIA L4, 24 GB) │
│ │
│ :7860 proxy.py bearer-auth FastAPI │
│ ├─ /v1/chat/* /v1/embeddings → :8000 │
│ └─ /v1/{prithvi,terramind,...} → :7861 │
│ └─ /v1/power NVML readings │
│ │
│ :8000 vLLM Granite 4.1 8B FP8 │
│ :7861 riprap-models │
│ Prithvi-EO 2.0 NYC-Pluvial │
│ TerraMind LULC + Buildings │
│ Granite TTM r2 │
│ GLiNER + Granite Embedding 278M │
└────────────────────▲─────────────────────────┘
│ bearer-auth HTTPS
┌─────────────────────────────┴──────────────────────────┐
│ lablab-ai-amd-developer-hackathon/riprap-nyc │
│ Hackathon submission UI · cpu-basic │
│ │
│ FastAPI (web/main.py) + SvelteKit static build │
│ Burr FSM (app/fsm.py) │
│ │
│ RIPRAP_LLM_BASE_URL = …/v1 │
│ RIPRAP_ML_BASE_URL = … │
└────────────────────────────────────────────────────────┘
```
The UI Space holds no GPU weights and contacts no commercial APIs.
Every model call routes through the bearer-authenticated proxy on
the inference Space.
---
## Hugging Face Spaces
### `lablab-ai-amd-developer-hackathon/riprap-nyc` — UI Space
The hackathon submission. CPU-basic tier. Image built from the root
`Dockerfile`. Holds no model weights — every inference call goes
remote via env vars.
**Required Space variables:**
```
RIPRAP_LLM_PRIMARY = vllm
RIPRAP_LLM_BASE_URL = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME = granite4.1:8b
RIPRAP_ML_BACKEND = remote
RIPRAP_ML_BASE_URL = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS = 1
RIPRAP_HEAVY_SPECIALISTS = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE = 1
RIPRAP_EO_CHIP_ENABLE = 1
```
**Required secrets** (set via Settings → Variables and secrets):
```
RIPRAP_LLM_API_KEY bearer token shared with the inference Space
RIPRAP_ML_API_KEY bearer token shared with the inference Space
HF_TOKEN for register / catalog downloads
```
### `msradam/riprap-vllm` — Inference Space
L4 (`l4x1`) tier. Image built from `inference-vllm/Dockerfile`.
Bakes Granite 4.1 8B FP8 weights and the EO model dependencies
(terratorch + peft + diffusers + segmentation-models-pytorch +
nvidia-ml-py for NVML power sampling).
**Required secret:**
```
RIPRAP_PROXY_TOKEN bearer token; must match RIPRAP_LLM_API_KEY /
RIPRAP_ML_API_KEY on the UI Spaces
```
**Endpoints:**
| Path | Routes to | Notes |
|---|---|---|
| `POST /v1/chat/completions` | vLLM | Granite 4.1 8B FP8, OpenAI-compat |
| `POST /v1/completions` | vLLM | OpenAI-compat |
| `GET /v1/models` | vLLM | served-model-name family |
| `POST /v1/embeddings` | riprap-models | Granite Embedding 278M |
| `POST /v1/prithvi-pluvial` | riprap-models | Prithvi-EO 2.0 NYC-Pluvial |
| `POST /v1/terramind` | riprap-models | TerraMind LULC / Buildings / synthesis |
| `POST /v1/ttm-forecast` | riprap-models | Granite TTM r2 + Battery surge |
| `POST /v1/gliner-extract` | riprap-models | GLiNER typed-entity |
| `GET /v1/power` | proxy | Real NVML power (W) — see `docs/EMISSIONS.md` |
| `GET /healthz` | proxy + both backends | Aggregates health status |
All `/v1/*` endpoints require `Authorization: Bearer <PROXY_TOKEN>`.
`/v1/power` and the bracket-sampling LLM client path are described
in [`docs/EMISSIONS.md`](EMISSIONS.md).
---
## Personal mirror — `msradam/riprap`
Self-contained L4 mirror that runs the full stack (UI + vLLM + EO
models) in a single container. Used for parallel demos when the
shared inference Space is busy. Built from `Dockerfile.l4`.
```bash
scripts/deploy_personal_space.sh
```
This is paused by default for the hackathon period to keep the L4
budget on the primary inference Space.
---
## Local development
### Pure local (Ollama)
```bash
uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```
Visit `http://127.0.0.1:7860`. Inference runs locally — no GPU
power readings (the chip will display the data-sheet estimate with
a `~` icon).
### Local UI, remote inference
```bash
RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```
Same flow as the hosted UI Space, but rendered locally. Real NVML
power readings come back through the proxy headers and bracket
samples just like in production.
---
## Deploy commands
| Target | Script | Notes |
|---|---|---|
| Inference Space (`msradam/riprap-vllm`) | `scripts/deploy_vllm_space.sh` | Orphan-branch push from `inference-vllm/` |
| UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`) | cherry-pick onto `huggingface/main` then `git push huggingface` | HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it |
| Personal mirror (`msradam/riprap`) | `scripts/deploy_personal_space.sh` | Orphan-branch push from `Dockerfile.l4` |
| Inference fallback (`msradam/riprap-inference`) | `scripts/deploy_inference_space.sh` | Ollama-backed mirror; redundant when riprap-vllm is up |
---
## Verifying a deploy
```bash
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
```
Asserts: all five Stones fire, no torchvision/terratorch dep
regression, the `emissions` block reports `nvidia_l4` hardware, and
real NVML measurements come through (`n_measured``n_calls`).
The address probe sweeps the full canonical set (5 NYC addresses):
```bash
.venv/bin/python scripts/probe_addresses.py \
--base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
```
---
## Historical notes
The hackathon submission was originally built against an AMD MI300X
DigitalOcean droplet (running both vLLM and the EO model service).
The droplet was decommissioned **2026-05-06** and inference moved
to the L4 HF Spaces above. The bring-up runbook for the MI300X
droplet is preserved in [`docs/DROPLET-RUNBOOK.md`](DROPLET-RUNBOOK.md)
for anyone reproducing the original AMD-judging setup; setting
`RIPRAP_HARDWARE_LABEL=AMD MI300X` on a droplet redeploy will swap
the emissions ledger back to the MI300X data-sheet figures.