# Deployment topology

Riprap is composed of two HF Spaces in production. The **UI Space**
is CPU-only and contains the FastAPI + SvelteKit front-end; the
**inference Space** is an L4 GPU and runs vLLM (Granite 4.1 8B FP8)
plus the EO model stack co-resident.

```
                ┌──────────────────────────────────────────────┐
                │   msradam/riprap-vllm  (NVIDIA L4, 24 GB)    │
                │                                              │
                │   :7860  proxy.py    bearer-auth FastAPI     │
                │      ├─ /v1/chat/* /v1/embeddings  → :8000   │
                │      └─ /v1/{prithvi,terramind,...} → :7861  │
                │      └─ /v1/power   NVML readings            │
                │                                              │
                │   :8000  vLLM  Granite 4.1 8B FP8            │
                │   :7861  riprap-models                       │
                │          Prithvi-EO 2.0 NYC-Pluvial          │
                │          TerraMind LULC + Buildings          │
                │          Granite TTM r2                      │
                │          GLiNER + Granite Embedding 278M     │
                └────────────────────▲─────────────────────────┘
                                     │ bearer-auth HTTPS
                                     │
       ┌─────────────────────────────┴──────────────────────────┐
       │  lablab-ai-amd-developer-hackathon/riprap-nyc          │
       │  Hackathon submission UI · cpu-basic                   │
       │                                                        │
       │  FastAPI (web/main.py)  +  SvelteKit static build      │
       │  Burr FSM (app/fsm.py)                                 │
       │                                                        │
       │  RIPRAP_LLM_BASE_URL = …/v1                            │
       │  RIPRAP_ML_BASE_URL  = …                               │
       └────────────────────────────────────────────────────────┘
```

The UI Space holds no GPU weights and contacts no commercial APIs.
Every model call routes through the bearer-authenticated proxy on
the inference Space.

---

## Hugging Face Spaces

### `lablab-ai-amd-developer-hackathon/riprap-nyc` — UI Space

The hackathon submission. CPU-basic tier. Image built from the root
`Dockerfile`. Holds no model weights — every inference call goes
remote via env vars.

**Required Space variables:**

```
RIPRAP_LLM_PRIMARY        = vllm
RIPRAP_LLM_BASE_URL       = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME   = granite4.1:8b
RIPRAP_ML_BACKEND         = remote
RIPRAP_ML_BASE_URL        = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS    = 1
RIPRAP_HEAVY_SPECIALISTS  = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE   = 1
RIPRAP_EO_CHIP_ENABLE     = 1
```

**Required secrets** (set via Settings → Variables and secrets):

```
RIPRAP_LLM_API_KEY        bearer token shared with the inference Space
RIPRAP_ML_API_KEY         bearer token shared with the inference Space
HF_TOKEN                  for register / catalog downloads
```

### `msradam/riprap-vllm` — Inference Space

L4 (`l4x1`) tier. Image built from `inference-vllm/Dockerfile`.
Bakes Granite 4.1 8B FP8 weights and the EO model dependencies
(terratorch + peft + diffusers + segmentation-models-pytorch +
nvidia-ml-py for NVML power sampling).

**Required secret:**

```
RIPRAP_PROXY_TOKEN        bearer token; must match RIPRAP_LLM_API_KEY /
                          RIPRAP_ML_API_KEY on the UI Spaces
```

**Endpoints:**

| Path | Routes to | Notes |
|---|---|---|
| `POST /v1/chat/completions` | vLLM | Granite 4.1 8B FP8, OpenAI-compat |
| `POST /v1/completions` | vLLM | OpenAI-compat |
| `GET  /v1/models` | vLLM | served-model-name family |
| `POST /v1/embeddings` | riprap-models | Granite Embedding 278M |
| `POST /v1/prithvi-pluvial` | riprap-models | Prithvi-EO 2.0 NYC-Pluvial |
| `POST /v1/terramind` | riprap-models | TerraMind LULC / Buildings / synthesis |
| `POST /v1/ttm-forecast` | riprap-models | Granite TTM r2 + Battery surge |
| `POST /v1/gliner-extract` | riprap-models | GLiNER typed-entity |
| `GET  /v1/power` | proxy | Real NVML power (W) — see `docs/EMISSIONS.md` |
| `GET  /healthz` | proxy + both backends | Aggregates health status |

All `/v1/*` endpoints require `Authorization: Bearer <PROXY_TOKEN>`.
`/v1/power` and the bracket-sampling LLM client path are described
in [`docs/EMISSIONS.md`](EMISSIONS.md).

---

## Personal mirror — `msradam/riprap`

Self-contained L4 mirror that runs the full stack (UI + vLLM + EO
models) in a single container. Used for parallel demos when the
shared inference Space is busy. Built from `Dockerfile.l4`.

```bash
scripts/deploy_personal_space.sh
```

This is paused by default for the hackathon period to keep the L4
budget on the primary inference Space.

---

## Local development

### Pure local (Ollama)

```bash
uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```

Visit `http://127.0.0.1:7860`. Inference runs locally — no GPU
power readings (the chip will display the data-sheet estimate with
a `~` icon).

### Local UI, remote inference

```bash
RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```

Same flow as the hosted UI Space, but rendered locally. Real NVML
power readings come back through the proxy headers and bracket
samples just like in production.

---

## Deploy commands

| Target | Script | Notes |
|---|---|---|
| Inference Space (`msradam/riprap-vllm`) | `scripts/deploy_vllm_space.sh` | Orphan-branch push from `inference-vllm/` |
| UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`) | cherry-pick onto `huggingface/main` then `git push huggingface` | HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it |
| Personal mirror (`msradam/riprap`) | `scripts/deploy_personal_space.sh` | Orphan-branch push from `Dockerfile.l4` |
| Inference fallback (`msradam/riprap-inference`) | `scripts/deploy_inference_space.sh` | Ollama-backed mirror; redundant when riprap-vllm is up |

---

## Verifying a deploy

```bash
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
```

Asserts: all five Stones fire, no torchvision/terratorch dep
regression, the `emissions` block reports `nvidia_l4` hardware, and
real NVML measurements come through (`n_measured` ≈ `n_calls`).

The address probe sweeps the full canonical set (5 NYC addresses):

```bash
.venv/bin/python scripts/probe_addresses.py \
    --base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
```

---

## Historical notes

The hackathon submission was originally built against an AMD MI300X
DigitalOcean droplet (running both vLLM and the EO model service).
The droplet was decommissioned **2026-05-06** and inference moved
to the L4 HF Spaces above. The bring-up runbook for the MI300X
droplet is preserved in [`docs/DROPLET-RUNBOOK.md`](DROPLET-RUNBOOK.md)
for anyone reproducing the original AMD-judging setup; setting
`RIPRAP_HARDWARE_LABEL=AMD MI300X` on a droplet redeploy will swap
the emissions ledger back to the MI300X data-sheet figures.