Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.18.0
Deploying Her · हेर to a Hugging Face ZeroGPU Space
A one-shot runbook to stand up Her on a private or public Hugging Face ZeroGPU
Space. Written so another operator (human or agent) can do it end-to-end without
prior context. The whole thing is automated by scripts/deploy.py; the rest of this
doc explains what it does and how to verify + troubleshoot.
0. What gets deployed
Her runs in Gradio Server mode (gradio.Server) because ZeroGPU only supports
the Gradio SDK and its GPU quota needs the HF iframe auth headers forwarded:
- Deterministic engine endpoints (
/api/health|sessions|upload|analyze|project|clear|consent) are plain FastAPI routes the React UI calls withfetch. - GPU narration (
overview/advice/chat/project_chat/project_narrative) are Gradio API endpoints (@app.api) the browser calls via@gradio/client(forwards auth). - The built React SPA (
ui/dist) is served from/. - Uploaded sessions persist on an HF storage bucket mounted at
/data, namespaced per browser (/data/<sha256(token)>/…), auto-deleted after 24h / on Clear / on exit. - A shared binary registry lives at
/data/_registry/(outside all user namespaces) and is enriched over time (local bundled DB → Nemotron → public registries).
1. Prerequisites
- A Hugging Face PRO account (required for private ZeroGPU Spaces and ZeroGPU quota). Org Team/Enterprise plan if deploying under an org.
- A write token: https://huggingface.co/settings/tokens → run
hf auth login(orexport HF_TOKEN=hf_…). - Python 3.10+ with
huggingface_hub(use the deploy venv:python3.10 -m venv .venv-deploy && . .venv-deploy/bin/activate && pip install "huggingface_hub>=1.0"). - Node 18/20 + npm (to build the UI once).
2. Build the UI (required — the Space does NOT run npm)
cd ui && npm install && npm run build && cd ..
# produces ui/dist (git-ignored, but shipped by deploy.py)
Optional — refresh the local binary registry (top CLI tools from Homebrew + npm + PyPI):
python3 scripts/build_binaries_db.py # writes narrator/knowledge/binaries.bundled.json
3. Deploy (one command)
# PRIVATE test space (update an existing one — bucket already mounted):
python scripts/deploy.py --space <owner>/her
# PUBLIC hackathon space from scratch (creates space + bucket, mounts it, makes it public):
python scripts/deploy.py --space <org>/her --public --create
deploy.py is idempotent and does all of:
create_repo(space_sdk="gradio", private=…, exist_ok=True)+ enforce visibility.create_bucket(<owner>/<name>-data)(only with--create/--factory).- Set the four Space variables:
SPACE_MODEL_REPO=nvidia/Nemotron-Mini-4B-InstructHER_DATA_DIR=/dataHER_EXTRA_ROOT=/dataHER_LEARNED_PATH=/data/_registry/binaries.learned.json
- Attach the bucket volume at
/data(only with--create/--factory). upload_folder(shipsui/dist+ the bundled DB; excludes trace content, venvs, node_modules,.git,*.gguf).request_space_hardware("zero-a10g")(ZeroGPU).restart_space(factory_reboot=True)(with--create/--factory) — required the first time a bucket is attached, otherwise/datais ephemeral container disk.
Why
--creatematters: a plain restart does NOT mount a newly-attached bucket. The factory reboot does. For later code-only updates, drop--create(faster).
4. Verify
# in the deploy venv; uses your HF token for the (possibly private) Space
import httpx, time
from huggingface_hub import get_token
H = {"Authorization": f"Bearer {get_token()}"}
BASE = "https://<owner>-her.hf.space" # owner/name -> owner-name.hf.space
c = httpx.Client(headers=H, timeout=180, follow_redirects=True)
print(c.get(BASE + "/api/health").json()) # -> {"ok":true,"llama":true,"gpu":true}
# upload a .jsonl, then:
# GET /api/analyze?path=<returned path> with header X-Her-Client: <any token>
# -> engine JSON (turns/tools/cost/binaries)
# GPU narration (forwards auth):
from gradio_client import Client
gc = Client("<owner>/her", token=get_token())
print(gc.predict("<uploaded path>", "<client token>", api_name="/overview"))
Checklist:
health.llama == true→ the model loaded (watch build logs if not, see below).- Upload →
/api/sessions(withX-Her-Client) shows your sessions grouped into projects. - A different
X-Her-Clientsees nothing (per-user isolation). gradio_client.predict(..., api_name="/overview")returns grounded prose.
5. Troubleshooting
health.llama == false(model didn't load). Read logs:api.fetch_space_logs("<owner>/her"). The default modelnvidia/Nemotron-Mini-4B-Instructis a standard arch (loads natively). Do not use the Mamba-hybridNemotron-Nano-9B-v2— its remote code needsmamba-ssm/causal-conv1dCUDA kernels that don't build on ZeroGPU. Swap models by setting theSPACE_MODEL_REPOvariable (no redeploy needed; it restarts).- Uploads vanish on restart / bucket empty. The bucket wasn't mounted — re-run with
--factory(orrestart_space(factory_reboot=True)). Confirm withapi.list_bucket_tree("<owner>/her-data")after an upload. - ZeroGPU not requestable via API. Set it in Space → Settings → Hardware → ZeroGPU.
- Blank UI / 503.
ui/distwasn't built/shipped — runnpm run buildthen redeploy. - GPU calls 401/fail from the browser. They must go through
@gradio/client(not rawfetch) so the iframe auth forwards — this is already how the React app calls them.
6. Pinned versions & key facts
sdk: gradio,sdk_version: 6.16.0(Server mode),python_version: "3.10.13",app_file: app.py, hardwarezero-a10g.requirements.txt:gradio==6.16.0,spaces,python-multipart,torch,transformers>=4.48.3,<5,accelerate,sentencepiece,einops,huggingface_hub.- Model:
nvidia/Nemotron-Mini-4B-Instruct(swap viaSPACE_MODEL_REPO). @gradio/client(JS) pinned to match (^2.2.1).
7. Before a PUBLIC launch
- Privacy disclosure: show the "we never store your sessions; only anonymous tool
names are kept" copy in the first-run disclaimer (
ui/src/components/DisclaimerModal.jsx). - ZeroGPU quota: public visitors draw on the owner's ZeroGPU quota (then pre-paid credits). Consider a soft cap if traffic is high.
- Per-user isolation + 24h auto-clear are already on and public-safe.