her / DEPLOY.md
geekwrestler's picture
Squash history (purge pre-scrub demo session blobs)
5f43c7d

A newer version of the Gradio SDK is available: 6.18.0

Upgrade

Deploying Her · हेर to a Hugging Face ZeroGPU Space

A one-shot runbook to stand up Her on a private or public Hugging Face ZeroGPU Space. Written so another operator (human or agent) can do it end-to-end without prior context. The whole thing is automated by scripts/deploy.py; the rest of this doc explains what it does and how to verify + troubleshoot.


0. What gets deployed

Her runs in Gradio Server mode (gradio.Server) because ZeroGPU only supports the Gradio SDK and its GPU quota needs the HF iframe auth headers forwarded:

  • Deterministic engine endpoints (/api/health|sessions|upload|analyze|project|clear|consent) are plain FastAPI routes the React UI calls with fetch.
  • GPU narration (overview/advice/chat/project_chat/project_narrative) are Gradio API endpoints (@app.api) the browser calls via @gradio/client (forwards auth).
  • The built React SPA (ui/dist) is served from /.
  • Uploaded sessions persist on an HF storage bucket mounted at /data, namespaced per browser (/data/<sha256(token)>/…), auto-deleted after 24h / on Clear / on exit.
  • A shared binary registry lives at /data/_registry/ (outside all user namespaces) and is enriched over time (local bundled DB → Nemotron → public registries).

1. Prerequisites

  • A Hugging Face PRO account (required for private ZeroGPU Spaces and ZeroGPU quota). Org Team/Enterprise plan if deploying under an org.
  • A write token: https://huggingface.co/settings/tokens → run hf auth login (or export HF_TOKEN=hf_…).
  • Python 3.10+ with huggingface_hub (use the deploy venv: python3.10 -m venv .venv-deploy && . .venv-deploy/bin/activate && pip install "huggingface_hub>=1.0").
  • Node 18/20 + npm (to build the UI once).

2. Build the UI (required — the Space does NOT run npm)

cd ui && npm install && npm run build && cd ..
# produces ui/dist (git-ignored, but shipped by deploy.py)

Optional — refresh the local binary registry (top CLI tools from Homebrew + npm + PyPI):

python3 scripts/build_binaries_db.py        # writes narrator/knowledge/binaries.bundled.json

3. Deploy (one command)

# PRIVATE test space (update an existing one — bucket already mounted):
python scripts/deploy.py --space <owner>/her

# PUBLIC hackathon space from scratch (creates space + bucket, mounts it, makes it public):
python scripts/deploy.py --space <org>/her --public --create

deploy.py is idempotent and does all of:

  1. create_repo(space_sdk="gradio", private=…, exist_ok=True) + enforce visibility.
  2. create_bucket(<owner>/<name>-data) (only with --create/--factory).
  3. Set the four Space variables:
    • SPACE_MODEL_REPO=nvidia/Nemotron-Mini-4B-Instruct
    • HER_DATA_DIR=/data
    • HER_EXTRA_ROOT=/data
    • HER_LEARNED_PATH=/data/_registry/binaries.learned.json
  4. Attach the bucket volume at /data (only with --create/--factory).
  5. upload_folder (ships ui/dist + the bundled DB; excludes trace content, venvs, node_modules, .git, *.gguf).
  6. request_space_hardware("zero-a10g") (ZeroGPU).
  7. restart_space(factory_reboot=True) (with --create/--factory)required the first time a bucket is attached, otherwise /data is ephemeral container disk.

Why --create matters: a plain restart does NOT mount a newly-attached bucket. The factory reboot does. For later code-only updates, drop --create (faster).


4. Verify

# in the deploy venv; uses your HF token for the (possibly private) Space
import httpx, time
from huggingface_hub import get_token
H = {"Authorization": f"Bearer {get_token()}"}
BASE = "https://<owner>-her.hf.space"           # owner/name -> owner-name.hf.space
c = httpx.Client(headers=H, timeout=180, follow_redirects=True)

print(c.get(BASE + "/api/health").json())        # -> {"ok":true,"llama":true,"gpu":true}
# upload a .jsonl, then:
#   GET /api/analyze?path=<returned path>  with header X-Her-Client: <any token>
#   -> engine JSON (turns/tools/cost/binaries)
# GPU narration (forwards auth):
from gradio_client import Client
gc = Client("<owner>/her", token=get_token())
print(gc.predict("<uploaded path>", "<client token>", api_name="/overview"))

Checklist:

  • health.llama == true → the model loaded (watch build logs if not, see below).
  • Upload → /api/sessions (with X-Her-Client) shows your sessions grouped into projects.
  • A different X-Her-Client sees nothing (per-user isolation).
  • gradio_client.predict(..., api_name="/overview") returns grounded prose.

5. Troubleshooting

  • health.llama == false (model didn't load). Read logs: api.fetch_space_logs("<owner>/her"). The default model nvidia/Nemotron-Mini-4B-Instruct is a standard arch (loads natively). Do not use the Mamba-hybrid Nemotron-Nano-9B-v2 — its remote code needs mamba-ssm/causal-conv1d CUDA kernels that don't build on ZeroGPU. Swap models by setting the SPACE_MODEL_REPO variable (no redeploy needed; it restarts).
  • Uploads vanish on restart / bucket empty. The bucket wasn't mounted — re-run with --factory (or restart_space(factory_reboot=True)). Confirm with api.list_bucket_tree("<owner>/her-data") after an upload.
  • ZeroGPU not requestable via API. Set it in Space → Settings → Hardware → ZeroGPU.
  • Blank UI / 503. ui/dist wasn't built/shipped — run npm run build then redeploy.
  • GPU calls 401/fail from the browser. They must go through @gradio/client (not raw fetch) so the iframe auth forwards — this is already how the React app calls them.

6. Pinned versions & key facts

  • sdk: gradio, sdk_version: 6.16.0 (Server mode), python_version: "3.10.13", app_file: app.py, hardware zero-a10g.
  • requirements.txt: gradio==6.16.0, spaces, python-multipart, torch, transformers>=4.48.3,<5, accelerate, sentencepiece, einops, huggingface_hub.
  • Model: nvidia/Nemotron-Mini-4B-Instruct (swap via SPACE_MODEL_REPO).
  • @gradio/client (JS) pinned to match (^2.2.1).

7. Before a PUBLIC launch

  • Privacy disclosure: show the "we never store your sessions; only anonymous tool names are kept" copy in the first-run disclaimer (ui/src/components/DisclaimerModal.jsx).
  • ZeroGPU quota: public visitors draw on the owner's ZeroGPU quota (then pre-paid credits). Consider a soft cap if traffic is high.
  • Per-user isolation + 24h auto-clear are already on and public-safe.