File size: 8,108 Bytes
7cb5930
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Deployment topology

Riprap is composed of two HF Spaces in production. The **UI Space**
is CPU-only and contains the FastAPI + SvelteKit front-end; the
**inference Space** is an L4 GPU and runs vLLM (Granite 4.1 8B FP8)
plus the EO model stack co-resident.

```
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   msradam/riprap-vllm  (NVIDIA L4, 24 GB)    β”‚
                β”‚                                              β”‚
                β”‚   :7860  proxy.py    bearer-auth FastAPI     β”‚
                β”‚      β”œβ”€ /v1/chat/* /v1/embeddings  β†’ :8000   β”‚
                β”‚      └─ /v1/{prithvi,terramind,...} β†’ :7861  β”‚
                β”‚      └─ /v1/power   NVML readings            β”‚
                β”‚                                              β”‚
                β”‚   :8000  vLLM  Granite 4.1 8B FP8            β”‚
                β”‚   :7861  riprap-models                       β”‚
                β”‚          Prithvi-EO 2.0 NYC-Pluvial          β”‚
                β”‚          TerraMind LULC + Buildings          β”‚
                β”‚          Granite TTM r2                      β”‚
                β”‚          GLiNER + Granite Embedding 278M     β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚ bearer-auth HTTPS
                                     β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  lablab-ai-amd-developer-hackathon/riprap-nyc          β”‚
       β”‚  Hackathon submission UI Β· cpu-basic                   β”‚
       β”‚                                                        β”‚
       β”‚  FastAPI (web/main.py)  +  SvelteKit static build      β”‚
       β”‚  Burr FSM (app/fsm.py)                                 β”‚
       β”‚                                                        β”‚
       β”‚  RIPRAP_LLM_BASE_URL = …/v1                            β”‚
       β”‚  RIPRAP_ML_BASE_URL  = …                               β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

The UI Space holds no GPU weights and contacts no commercial APIs.
Every model call routes through the bearer-authenticated proxy on
the inference Space.

---

## Hugging Face Spaces

### `lablab-ai-amd-developer-hackathon/riprap-nyc` β€” UI Space

The hackathon submission. CPU-basic tier. Image built from the root
`Dockerfile`. Holds no model weights β€” every inference call goes
remote via env vars.

**Required Space variables:**

```
RIPRAP_LLM_PRIMARY        = vllm
RIPRAP_LLM_BASE_URL       = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME   = granite4.1:8b
RIPRAP_ML_BACKEND         = remote
RIPRAP_ML_BASE_URL        = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS    = 1
RIPRAP_HEAVY_SPECIALISTS  = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE   = 1
RIPRAP_EO_CHIP_ENABLE     = 1
```

**Required secrets** (set via Settings β†’ Variables and secrets):

```
RIPRAP_LLM_API_KEY        bearer token shared with the inference Space
RIPRAP_ML_API_KEY         bearer token shared with the inference Space
HF_TOKEN                  for register / catalog downloads
```

### `msradam/riprap-vllm` β€” Inference Space

L4 (`l4x1`) tier. Image built from `inference-vllm/Dockerfile`.
Bakes Granite 4.1 8B FP8 weights and the EO model dependencies
(terratorch + peft + diffusers + segmentation-models-pytorch +
nvidia-ml-py for NVML power sampling).

**Required secret:**

```
RIPRAP_PROXY_TOKEN        bearer token; must match RIPRAP_LLM_API_KEY /
                          RIPRAP_ML_API_KEY on the UI Spaces
```

**Endpoints:**

| Path | Routes to | Notes |
|---|---|---|
| `POST /v1/chat/completions` | vLLM | Granite 4.1 8B FP8, OpenAI-compat |
| `POST /v1/completions` | vLLM | OpenAI-compat |
| `GET  /v1/models` | vLLM | served-model-name family |
| `POST /v1/embeddings` | riprap-models | Granite Embedding 278M |
| `POST /v1/prithvi-pluvial` | riprap-models | Prithvi-EO 2.0 NYC-Pluvial |
| `POST /v1/terramind` | riprap-models | TerraMind LULC / Buildings / synthesis |
| `POST /v1/ttm-forecast` | riprap-models | Granite TTM r2 + Battery surge |
| `POST /v1/gliner-extract` | riprap-models | GLiNER typed-entity |
| `GET  /v1/power` | proxy | Real NVML power (W) β€” see `docs/EMISSIONS.md` |
| `GET  /healthz` | proxy + both backends | Aggregates health status |

All `/v1/*` endpoints require `Authorization: Bearer <PROXY_TOKEN>`.
`/v1/power` and the bracket-sampling LLM client path are described
in [`docs/EMISSIONS.md`](EMISSIONS.md).

---

## Personal mirror β€” `msradam/riprap`

Self-contained L4 mirror that runs the full stack (UI + vLLM + EO
models) in a single container. Used for parallel demos when the
shared inference Space is busy. Built from `Dockerfile.l4`.

```bash
scripts/deploy_personal_space.sh
```

This is paused by default for the hackathon period to keep the L4
budget on the primary inference Space.

---

## Local development

### Pure local (Ollama)

```bash
uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```

Visit `http://127.0.0.1:7860`. Inference runs locally β€” no GPU
power readings (the chip will display the data-sheet estimate with
a `~` icon).

### Local UI, remote inference

```bash
RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860
```

Same flow as the hosted UI Space, but rendered locally. Real NVML
power readings come back through the proxy headers and bracket
samples just like in production.

---

## Deploy commands

| Target | Script | Notes |
|---|---|---|
| Inference Space (`msradam/riprap-vllm`) | `scripts/deploy_vllm_space.sh` | Orphan-branch push from `inference-vllm/` |
| UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`) | cherry-pick onto `huggingface/main` then `git push huggingface` | HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it |
| Personal mirror (`msradam/riprap`) | `scripts/deploy_personal_space.sh` | Orphan-branch push from `Dockerfile.l4` |
| Inference fallback (`msradam/riprap-inference`) | `scripts/deploy_inference_space.sh` | Ollama-backed mirror; redundant when riprap-vllm is up |

---

## Verifying a deploy

```bash
PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600
```

Asserts: all five Stones fire, no torchvision/terratorch dep
regression, the `emissions` block reports `nvidia_l4` hardware, and
real NVML measurements come through (`n_measured` β‰ˆ `n_calls`).

The address probe sweeps the full canonical set (5 NYC addresses):

```bash
.venv/bin/python scripts/probe_addresses.py \
    --base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space
```

---

## Historical notes

The hackathon submission was originally built against an AMD MI300X
DigitalOcean droplet (running both vLLM and the EO model service).
The droplet was decommissioned **2026-05-06** and inference moved
to the L4 HF Spaces above. The bring-up runbook for the MI300X
droplet is preserved in [`docs/DROPLET-RUNBOOK.md`](DROPLET-RUNBOOK.md)
for anyone reproducing the original AMD-judging setup; setting
`RIPRAP_HARDWARE_LABEL=AMD MI300X` on a droplet redeploy will swap
the emissions ledger back to the MI300X data-sheet figures.