File size: 8,108 Bytes
7cb5930 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | # Deployment topology
Riprap is composed of two HF Spaces in production. The **UI Space**
is CPU-only and contains the FastAPI + SvelteKit front-end; the
**inference Space** is an L4 GPU and runs vLLM (Granite 4.1 8B FP8)
plus the EO model stack co-resident.
```
ββββββββββββββββββββββββββββββββββββββββββββββββ
β msradam/riprap-vllm (NVIDIA L4, 24 GB) β
β β
β :7860 proxy.py bearer-auth FastAPI β
β ββ /v1/chat/* /v1/embeddings β :8000 β
β ββ /v1/{prithvi,terramind,...} β :7861 β
β ββ /v1/power NVML readings β
β β
β :8000 vLLM Granite 4.1 8B FP8 β
β :7861 riprap-models β
β Prithvi-EO 2.0 NYC-Pluvial β
β TerraMind LULC + Buildings β
β Granite TTM r2 β
β GLiNER + Granite Embedding 278M β
ββββββββββββββββββββββ²ββββββββββββββββββββββββββ
β bearer-auth HTTPS
β
βββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
β lablab-ai-amd-developer-hackathon/riprap-nyc β
β Hackathon submission UI Β· cpu-basic β
β β
β FastAPI (web/main.py) + SvelteKit static build β
β Burr FSM (app/fsm.py) β
β β
β RIPRAP_LLM_BASE_URL = β¦/v1 β
β RIPRAP_ML_BASE_URL = β¦ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
The UI Space holds no GPU weights and contacts no commercial APIs.
Every model call routes through the bearer-authenticated proxy on
the inference Space.
---
## Hugging Face Spaces
### `lablab-ai-amd-developer-hackathon/riprap-nyc` β UI Space
The hackathon submission. CPU-basic tier. Image built from the root
`Dockerfile`. Holds no model weights β every inference call goes
remote via env vars.
**Required Space variables:**
```
RIPRAP_LLM_PRIMARY = vllm
RIPRAP_LLM_BASE_URL = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME = granite4.1:8b
RIPRAP_ML_BACKEND = remote
RIPRAP_ML_BASE_URL = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS = 1
RIPRAP_HEAVY_SPECIALISTS = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE = 1
RIPRAP_EO_CHIP_ENABLE = 1
```
**Required secrets** (set via Settings β Variables and secrets):
```
RIPRAP_LLM_API_KEY bearer token shared with the inference Space
RIPRAP_ML_API_KEY bearer token shared with the inference Space
HF_TOKEN for register / catalog downloads
```
### `msradam/riprap-vllm` β Inference Space
L4 (`l4x1`) tier. Image built from `inference-vllm/Dockerfile`.
Bakes Granite 4.1 8B FP8 weights and the EO model dependencies
(terratorch + peft + diffusers + segmentation-models-pytorch +
nvidia-ml-py for NVML power sampling).
**Required secret:**
```
RIPRAP_PROXY_TOKEN bearer token; must match RIPRAP_LLM_API_KEY /
RIPRAP_ML_API_KEY on the UI Spaces
```
**Endpoints:**
| Path | Routes to | Notes |
|---|---|---|
| `POST /v1/chat/completions` | vLLM | Granite 4.1 8B FP8, OpenAI-compat |
| `POST /v1/completions` | vLLM | OpenAI-compat |
| `GET /v1/models` | vLLM | served-model-name family |
| `POST /v1/embeddings` | riprap-models | Granite Embedding 278M |
| `POST /v1/prithvi-pluvial` | riprap-models | Prithvi-EO 2.0 NYC-Pluvial |
| `POST /v1/terramind` | riprap-models | TerraMind LULC / Buildings / synthesis |
| `POST /v1/ttm-forecast` | riprap-models | Granite TTM r2 + Battery surge |
| `POST /v1/gliner-extract` | riprap-models | GLiNER typed-entity |
| `GET /v1/power` | proxy | Real NVML power (W) β see `docs/EMISSIONS.md` |
| `GET /healthz` | proxy + both backends | Aggregates health status |
All `/v1/*` endpoints require `Authorization: Bearer <PROXY_TOKEN>`.
`/v1/power` and the bracket-sampling LLM client path are described
in [`docs/EMISSIONS.md`](EMISSIONS.md).
---
## Personal mirror β `msradam/riprap`
Self-contained L4 mirror that runs the full stack (UI + vLLM + EO
models) in a single container. Used for parallel demos when the
shared inference Space is busy. Built from `Dockerfile.l4`.
```bash
scripts/deploy_personal_space.sh
```
This is paused by default for the hackathon period to keep the L4
budget on the primary inference Space.
---
## Local development
### Pure local (Ollama)
```bash
uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```
Visit `http://127.0.0.1:7860`. Inference runs locally β no GPU
power readings (the chip will display the data-sheet estimate with
a `~` icon).
### Local UI, remote inference
```bash
RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```
Same flow as the hosted UI Space, but rendered locally. Real NVML
power readings come back through the proxy headers and bracket
samples just like in production.
---
## Deploy commands
| Target | Script | Notes |
|---|---|---|
| Inference Space (`msradam/riprap-vllm`) | `scripts/deploy_vllm_space.sh` | Orphan-branch push from `inference-vllm/` |
| UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`) | cherry-pick onto `huggingface/main` then `git push huggingface` | HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it |
| Personal mirror (`msradam/riprap`) | `scripts/deploy_personal_space.sh` | Orphan-branch push from `Dockerfile.l4` |
| Inference fallback (`msradam/riprap-inference`) | `scripts/deploy_inference_space.sh` | Ollama-backed mirror; redundant when riprap-vllm is up |
---
## Verifying a deploy
```bash
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
```
Asserts: all five Stones fire, no torchvision/terratorch dep
regression, the `emissions` block reports `nvidia_l4` hardware, and
real NVML measurements come through (`n_measured` β `n_calls`).
The address probe sweeps the full canonical set (5 NYC addresses):
```bash
.venv/bin/python scripts/probe_addresses.py \
--base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
```
---
## Historical notes
The hackathon submission was originally built against an AMD MI300X
DigitalOcean droplet (running both vLLM and the EO model service).
The droplet was decommissioned **2026-05-06** and inference moved
to the L4 HF Spaces above. The bring-up runbook for the MI300X
droplet is preserved in [`docs/DROPLET-RUNBOOK.md`](DROPLET-RUNBOOK.md)
for anyone reproducing the original AMD-judging setup; setting
`RIPRAP_HARDWARE_LABEL=AMD MI300X` on a droplet redeploy will swap
the emissions ledger back to the MI300X data-sheet figures.
|