Scrypt / DEPLOY.md
IMJONEZZ's picture
DEPLOY: switch plan to ZeroGPU (Gradio SDK, on-Space inference, no API key)
11143b6
|
Raw
History Blame Contribute Delete
4.14 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Deploying SCRYPT to HuggingFace — ZeroGPU plan

Target: build-small-hackathon/scrypt. We deploy on ZeroGPU (HF Pro): the Warden runs on the Space itself — no third-party API key, no OpenRouter. The Space becomes self-hosted inference, which also answers "isn't the Warden supposed to stay local?" — it is; the Space is just someone else's localhost.

Hard constraints (verified against HF docs, 2026-06-12)

  • ZeroGPU Spaces are Gradio SDK only — no Docker Spaces. Our gradio.Server web layer survives, but it must run inside the Gradio SDK runtime (sdk: gradio frontmatter + requirements.txt), not our Dockerfile.
  • Hosting under an org requires the org to have ZeroGPU enabled (Team/Enterprise — hackathon orgs often get it granted). Personal PRO accounts can host up to 10 ZeroGPU Spaces. → Decision gate #1, check first: create the Space under build-small-hackathon and see if ZeroGPU appears in the hardware options. If not: host under your account, transfer to the org later (or keep the org Space as a CPU/API mirror).
  • GPU = RTX Pro 6000 Blackwell slice: large 48GB VRAM (1× quota) or xlarge 96GB (2× quota). BF16 30B (60GB) needs xlarge; **4-bit (18GB) fits large** — and the local game runs a Q4 GGUF anyway, so 4-bit is quality-parity, not a downgrade. Start 4-bit/large.
  • Model must be moved to cuda at module level (ZeroGPU emulates CUDA at startup); GPU attaches only inside @spaces.GPU(duration=...) functions. Default 60s/call; visitors burn daily quota (2 min unauthenticated / 5 min free / 40 min PRO) — our calls are ~1.5K-token prefill + short generations on a 3.5B-active MoE, i.e. seconds per call. Plenty.

Architecture changes (space/ rework)

  1. app.py stays a Gradio-hosted FastAPI: build a minimal gr.Blocks (the "engine room" — can be a hidden status page), launch it, and attach our routes (/ CRT page, /api/whisper, /play, WS /pty) to gradio's FastAPI app. Spike risk: validate custom routes + websocket survive the Space proxy in a throwaway ZeroGPU Space before porting everything.
  2. Inference: per-visitor game PTYs are subprocesses and cannot call @spaces.GPU themselves. The main process exposes an internal OpenAI-style endpoint (POST /v1/internal/generate) whose handler is the @spaces.GPU generator (transformers + TextIteratorStreamer, bitsandbytes 4-bit, trust_remote_code). Game subprocesses run SCRYPT_BACKEND=api pointed at http://127.0.0.1:7860/... — the existing api backend, new base URL. No game-code changes.
  3. Model source: until the finetune ships, load nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 quantized 4-bit at startup. After the merge→export→eval gate passes, upload the merged Warden to build-small-hackathon/warden-nemotron-30b and point the Space there — the Space then runs the finetuned Warden.
  4. Keep the root Dockerfile (it no longer drives the Space, but stays the local/self-host path); frontmatter flips sdk: dockersdk: gradio + app_file: space/app.py + python_version: 3.12 when we cut over.
  5. SCRYPT_API_KEY is no longer needed on the Space. Scripted-Warden fallback stays as the safety net (quota exhausted / GPU queue too long).

Your steps (interactive auth — run with ! prefix)

# once: fix the root-owned HF cache that crashes the hf CLI, then login
sudo chown -R imjonezz:imjonezz ~/.cache/huggingface
hf auth login

# decision gate #1: try creating under the org with Gradio SDK
hf repo create build-small-hackathon/scrypt --repo-type space --space-sdk gradio
# then in Space settings → check whether ZeroGPU is offered under Hardware.
# If not offered: hf repo create scrypt --repo-type space --space-sdk gradio (personal)

git push space main   # remote already wired; update URL first if personal

No secrets to set in the ZeroGPU plan. First build ~5 min (model download ~17GB happens at first startup). Smoke test: / CRT page, /api/whisper, full run via /play, and watch quota burn in the Space's ZeroGPU panel.