riprap-nyc / docs /DEPLOY.md
seriffic's picture
docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING
7cb5930

Deployment topology

Riprap is composed of two HF Spaces in production. The UI Space is CPU-only and contains the FastAPI + SvelteKit front-end; the inference Space is an L4 GPU and runs vLLM (Granite 4.1 8B FP8) plus the EO model stack co-resident.

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚   msradam/riprap-vllm  (NVIDIA L4, 24 GB)    β”‚
                β”‚                                              β”‚
                β”‚   :7860  proxy.py    bearer-auth FastAPI     β”‚
                β”‚      β”œβ”€ /v1/chat/* /v1/embeddings  β†’ :8000   β”‚
                β”‚      └─ /v1/{prithvi,terramind,...} β†’ :7861  β”‚
                β”‚      └─ /v1/power   NVML readings            β”‚
                β”‚                                              β”‚
                β”‚   :8000  vLLM  Granite 4.1 8B FP8            β”‚
                β”‚   :7861  riprap-models                       β”‚
                β”‚          Prithvi-EO 2.0 NYC-Pluvial          β”‚
                β”‚          TerraMind LULC + Buildings          β”‚
                β”‚          Granite TTM r2                      β”‚
                β”‚          GLiNER + Granite Embedding 278M     β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚ bearer-auth HTTPS
                                     β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  lablab-ai-amd-developer-hackathon/riprap-nyc          β”‚
       β”‚  Hackathon submission UI Β· cpu-basic                   β”‚
       β”‚                                                        β”‚
       β”‚  FastAPI (web/main.py)  +  SvelteKit static build      β”‚
       β”‚  Burr FSM (app/fsm.py)                                 β”‚
       β”‚                                                        β”‚
       β”‚  RIPRAP_LLM_BASE_URL = …/v1                            β”‚
       β”‚  RIPRAP_ML_BASE_URL  = …                               β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The UI Space holds no GPU weights and contacts no commercial APIs. Every model call routes through the bearer-authenticated proxy on the inference Space.


Hugging Face Spaces

lablab-ai-amd-developer-hackathon/riprap-nyc β€” UI Space

The hackathon submission. CPU-basic tier. Image built from the root Dockerfile. Holds no model weights β€” every inference call goes remote via env vars.

Required Space variables:

RIPRAP_LLM_PRIMARY        = vllm
RIPRAP_LLM_BASE_URL       = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME   = granite4.1:8b
RIPRAP_ML_BACKEND         = remote
RIPRAP_ML_BASE_URL        = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS    = 1
RIPRAP_HEAVY_SPECIALISTS  = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE   = 1
RIPRAP_EO_CHIP_ENABLE     = 1

Required secrets (set via Settings β†’ Variables and secrets):

RIPRAP_LLM_API_KEY        bearer token shared with the inference Space
RIPRAP_ML_API_KEY         bearer token shared with the inference Space
HF_TOKEN                  for register / catalog downloads

msradam/riprap-vllm β€” Inference Space

L4 (l4x1) tier. Image built from inference-vllm/Dockerfile. Bakes Granite 4.1 8B FP8 weights and the EO model dependencies (terratorch + peft + diffusers + segmentation-models-pytorch + nvidia-ml-py for NVML power sampling).

Required secret:

RIPRAP_PROXY_TOKEN        bearer token; must match RIPRAP_LLM_API_KEY /
                          RIPRAP_ML_API_KEY on the UI Spaces

Endpoints:

Path Routes to Notes
POST /v1/chat/completions vLLM Granite 4.1 8B FP8, OpenAI-compat
POST /v1/completions vLLM OpenAI-compat
GET /v1/models vLLM served-model-name family
POST /v1/embeddings riprap-models Granite Embedding 278M
POST /v1/prithvi-pluvial riprap-models Prithvi-EO 2.0 NYC-Pluvial
POST /v1/terramind riprap-models TerraMind LULC / Buildings / synthesis
POST /v1/ttm-forecast riprap-models Granite TTM r2 + Battery surge
POST /v1/gliner-extract riprap-models GLiNER typed-entity
GET /v1/power proxy Real NVML power (W) β€” see docs/EMISSIONS.md
GET /healthz proxy + both backends Aggregates health status

All /v1/* endpoints require Authorization: Bearer <PROXY_TOKEN>. /v1/power and the bracket-sampling LLM client path are described in docs/EMISSIONS.md.


Personal mirror β€” msradam/riprap

Self-contained L4 mirror that runs the full stack (UI + vLLM + EO models) in a single container. Used for parallel demos when the shared inference Space is busy. Built from Dockerfile.l4.

scripts/deploy_personal_space.sh

This is paused by default for the hackathon period to keep the L4 budget on the primary inference Space.


Local development

Pure local (Ollama)

uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860

Visit http://127.0.0.1:7860. Inference runs locally β€” no GPU power readings (the chip will display the data-sheet estimate with a ~ icon).

Local UI, remote inference

RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860

Same flow as the hosted UI Space, but rendered locally. Real NVML power readings come back through the proxy headers and bracket samples just like in production.


Deploy commands

Target Script Notes
Inference Space (msradam/riprap-vllm) scripts/deploy_vllm_space.sh Orphan-branch push from inference-vllm/
UI Space (lablab-ai-amd-developer-hackathon/riprap-nyc) cherry-pick onto huggingface/main then git push huggingface HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it
Personal mirror (msradam/riprap) scripts/deploy_personal_space.sh Orphan-branch push from Dockerfile.l4
Inference fallback (msradam/riprap-inference) scripts/deploy_inference_space.sh Ollama-backed mirror; redundant when riprap-vllm is up

Verifying a deploy

PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600

Asserts: all five Stones fire, no torchvision/terratorch dep regression, the emissions block reports nvidia_l4 hardware, and real NVML measurements come through (n_measured β‰ˆ n_calls).

The address probe sweeps the full canonical set (5 NYC addresses):

.venv/bin/python scripts/probe_addresses.py \
    --base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space

Historical notes

The hackathon submission was originally built against an AMD MI300X DigitalOcean droplet (running both vLLM and the EO model service). The droplet was decommissioned 2026-05-06 and inference moved to the L4 HF Spaces above. The bring-up runbook for the MI300X droplet is preserved in docs/DROPLET-RUNBOOK.md for anyone reproducing the original AMD-judging setup; setting RIPRAP_HARDWARE_LABEL=AMD MI300X on a droplet redeploy will swap the emissions ledger back to the MI300X data-sheet figures.