Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

App Files Files Community

riprap-nyc / docs /DEPLOY.md

seriffic

docs: full README + new EMISSIONS / DEPLOY / CHANGELOG / CONTRIBUTING

7cb5930 14 days ago

preview code

raw

history blame contribute delete

8.11 kB

Deployment topology

Riprap is composed of two HF Spaces in production. The UI Space is CPU-only and contains the FastAPI + SvelteKit front-end; the inference Space is an L4 GPU and runs vLLM (Granite 4.1 8B FP8) plus the EO model stack co-resident.

                ┌──────────────────────────────────────────────┐
                │   msradam/riprap-vllm  (NVIDIA L4, 24 GB)    │
                │                                              │
                │   :7860  proxy.py    bearer-auth FastAPI     │
                │      ├─ /v1/chat/* /v1/embeddings  → :8000   │
                │      └─ /v1/{prithvi,terramind,...} → :7861  │
                │      └─ /v1/power   NVML readings            │
                │                                              │
                │   :8000  vLLM  Granite 4.1 8B FP8            │
                │   :7861  riprap-models                       │
                │          Prithvi-EO 2.0 NYC-Pluvial          │
                │          TerraMind LULC + Buildings          │
                │          Granite TTM r2                      │
                │          GLiNER + Granite Embedding 278M     │
                └────────────────────▲─────────────────────────┘
                                     │ bearer-auth HTTPS
                                     │
       ┌─────────────────────────────┴──────────────────────────┐
       │  lablab-ai-amd-developer-hackathon/riprap-nyc          │
       │  Hackathon submission UI · cpu-basic                   │
       │                                                        │
       │  FastAPI (web/main.py)  +  SvelteKit static build      │
       │  Burr FSM (app/fsm.py)                                 │
       │                                                        │
       │  RIPRAP_LLM_BASE_URL = …/v1                            │
       │  RIPRAP_ML_BASE_URL  = …                               │
       └────────────────────────────────────────────────────────┘

The UI Space holds no GPU weights and contacts no commercial APIs. Every model call routes through the bearer-authenticated proxy on the inference Space.

Hugging Face Spaces

`lablab-ai-amd-developer-hackathon/riprap-nyc` — UI Space

The hackathon submission. CPU-basic tier. Image built from the root Dockerfile. Holds no model weights — every inference call goes remote via env vars.

Required Space variables:

RIPRAP_LLM_PRIMARY        = vllm
RIPRAP_LLM_BASE_URL       = https://msradam-riprap-vllm.hf.space/v1
RIPRAP_LLM_VLLM_8B_NAME   = granite4.1:8b
RIPRAP_ML_BACKEND         = remote
RIPRAP_ML_BASE_URL        = https://msradam-riprap-vllm.hf.space
RIPRAP_NYCHA_REGISTERS    = 1
RIPRAP_HEAVY_SPECIALISTS  = 1
RIPRAP_PRITHVI_LIVE_ENABLE= 1
RIPRAP_TERRAMIND_ENABLE   = 1
RIPRAP_EO_CHIP_ENABLE     = 1

Required secrets (set via Settings → Variables and secrets):

RIPRAP_LLM_API_KEY        bearer token shared with the inference Space
RIPRAP_ML_API_KEY         bearer token shared with the inference Space
HF_TOKEN                  for register / catalog downloads

`msradam/riprap-vllm` — Inference Space

L4 (l4x1) tier. Image built from inference-vllm/Dockerfile. Bakes Granite 4.1 8B FP8 weights and the EO model dependencies (terratorch + peft + diffusers + segmentation-models-pytorch + nvidia-ml-py for NVML power sampling).

Required secret:

RIPRAP_PROXY_TOKEN        bearer token; must match RIPRAP_LLM_API_KEY /
                          RIPRAP_ML_API_KEY on the UI Spaces

Endpoints:

Path	Routes to	Notes
`POST /v1/chat/completions`	vLLM	Granite 4.1 8B FP8, OpenAI-compat
`POST /v1/completions`	vLLM	OpenAI-compat
`GET /v1/models`	vLLM	served-model-name family
`POST /v1/embeddings`	riprap-models	Granite Embedding 278M
`POST /v1/prithvi-pluvial`	riprap-models	Prithvi-EO 2.0 NYC-Pluvial
`POST /v1/terramind`	riprap-models	TerraMind LULC / Buildings / synthesis
`POST /v1/ttm-forecast`	riprap-models	Granite TTM r2 + Battery surge
`POST /v1/gliner-extract`	riprap-models	GLiNER typed-entity
`GET /v1/power`	proxy	Real NVML power (W) — see `docs/EMISSIONS.md`
`GET /healthz`	proxy + both backends	Aggregates health status

All /v1/* endpoints require Authorization: Bearer <PROXY_TOKEN>. /v1/power and the bracket-sampling LLM client path are described in docs/EMISSIONS.md.

Personal mirror — `msradam/riprap`

Self-contained L4 mirror that runs the full stack (UI + vLLM + EO models) in a single container. Used for parallel demos when the shared inference Space is busy. Built from Dockerfile.l4.

scripts/deploy_personal_space.sh

This is paused by default for the hackathon period to keep the L4 budget on the primary inference Space.

Local development

Pure local (Ollama)

uv venv && uv pip install -r requirements.txt
cd web/sveltekit && npm ci && npm run build && cd ../..
ollama pull granite4.1:3b
ollama pull granite4.1:8b
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860

Visit http://127.0.0.1:7860. Inference runs locally — no GPU power readings (the chip will display the data-sheet estimate with a ~ icon).

Local UI, remote inference

RIPRAP_LLM_PRIMARY=vllm \
RIPRAP_LLM_BASE_URL=https://msradam-riprap-vllm.hf.space/v1 \
RIPRAP_LLM_API_KEY=<token> \
RIPRAP_ML_BACKEND=remote \
RIPRAP_ML_BASE_URL=https://msradam-riprap-vllm.hf.space \
RIPRAP_ML_API_KEY=<token> \
.venv/bin/uvicorn web.main:app --host 127.0.0.1 --port 7860

Same flow as the hosted UI Space, but rendered locally. Real NVML power readings come back through the proxy headers and bracket samples just like in production.

Deploy commands

Target	Script	Notes
Inference Space (`msradam/riprap-vllm`)	`scripts/deploy_vllm_space.sh`	Orphan-branch push from `inference-vllm/`
UI Space (`lablab-ai-amd-developer-hackathon/riprap-nyc`)	cherry-pick onto `huggingface/main` then `git push huggingface`	HF Spaces' xet hook rejects pushes that walk through commits with binaries; cherry-picking from a clean ancestor avoids it
Personal mirror (`msradam/riprap`)	`scripts/deploy_personal_space.sh`	Orphan-branch push from `Dockerfile.l4`
Inference fallback (`msradam/riprap-inference`)	`scripts/deploy_inference_space.sh`	Ollama-backed mirror; redundant when riprap-vllm is up

Verifying a deploy

PYTHONPATH=. uv run python scripts/probe_stones_fire.py --timeout 600

Asserts: all five Stones fire, no torchvision/terratorch dep regression, the emissions block reports nvidia_l4 hardware, and real NVML measurements come through (n_measured ≈ n_calls).

The address probe sweeps the full canonical set (5 NYC addresses):

.venv/bin/python scripts/probe_addresses.py \
    --base https://lablab-ai-amd-developer-hackathon-riprap-nyc.hf.space

Historical notes

The hackathon submission was originally built against an AMD MI300X DigitalOcean droplet (running both vLLM and the EO model service). The droplet was decommissioned 2026-05-06 and inference moved to the L4 HF Spaces above. The bring-up runbook for the MI300X droplet is preserved in docs/DROPLET-RUNBOOK.md for anyone reproducing the original AMD-judging setup; setting RIPRAP_HARDWARE_LABEL=AMD MI300X on a droplet redeploy will swap the emissions ledger back to the MI300X data-sheet figures.