Spaces:
Running on Zero
Running on Zero
| # Deploying SCRYPT to HuggingFace — ZeroGPU plan | |
| Target: `build-small-hackathon/scrypt`. We deploy on **ZeroGPU** (HF Pro): | |
| the Warden runs *on the Space itself* — no third-party API key, no OpenRouter. | |
| The Space becomes self-hosted inference, which also answers "isn't the Warden | |
| supposed to stay local?" — it is; the Space is just someone else's localhost. | |
| ## Hard constraints (verified against HF docs, 2026-06-12) | |
| - ZeroGPU Spaces are **Gradio SDK only** — no Docker Spaces. Our gradio.Server | |
| web layer survives, but it must run inside the Gradio SDK runtime | |
| (`sdk: gradio` frontmatter + `requirements.txt`), not our Dockerfile. | |
| - **Hosting under an org requires the org to have ZeroGPU enabled** | |
| (Team/Enterprise — hackathon orgs often get it granted). Personal PRO | |
| accounts can host up to 10 ZeroGPU Spaces. → **Decision gate #1, check | |
| first:** create the Space under `build-small-hackathon` and see if ZeroGPU | |
| appears in the hardware options. If not: host under your account, transfer | |
| to the org later (or keep the org Space as a CPU/API mirror). | |
| - GPU = RTX Pro 6000 Blackwell slice: `large` 48GB VRAM (1× quota) or | |
| `xlarge` 96GB (2× quota). BF16 30B (~60GB) needs `xlarge`; **4-bit (~18GB) | |
| fits `large`** — and the local game runs a Q4 GGUF anyway, so 4-bit is | |
| quality-parity, not a downgrade. Start 4-bit/`large`. | |
| - Model must be moved to `cuda` at module level (ZeroGPU emulates CUDA at | |
| startup); GPU attaches only inside `@spaces.GPU(duration=...)` functions. | |
| Default 60s/call; visitors burn daily quota (2 min unauthenticated / | |
| 5 min free / 40 min PRO) — our calls are ~1.5K-token prefill + short | |
| generations on a 3.5B-active MoE, i.e. seconds per call. Plenty. | |
| ## Architecture changes (space/ rework) | |
| 1. `app.py` stays a Gradio-hosted FastAPI: build a minimal `gr.Blocks` (the | |
| "engine room" — can be a hidden status page), launch it, and attach our | |
| routes (`/` CRT page, `/api/whisper`, `/play`, `WS /pty`) to gradio's | |
| FastAPI app. **Spike risk:** validate custom routes + websocket survive the | |
| Space proxy in a throwaway ZeroGPU Space before porting everything. | |
| 2. Inference: per-visitor game PTYs are subprocesses and **cannot** call | |
| `@spaces.GPU` themselves. The main process exposes an internal | |
| OpenAI-style endpoint (`POST /v1/internal/generate`) whose handler is the | |
| `@spaces.GPU` generator (transformers + `TextIteratorStreamer`, | |
| bitsandbytes 4-bit, `trust_remote_code`). Game subprocesses run | |
| `SCRYPT_BACKEND=api` pointed at `http://127.0.0.1:7860/...` — the existing | |
| api backend, new base URL. No game-code changes. | |
| 3. Model source: until the finetune ships, load | |
| `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` quantized 4-bit at startup. | |
| After the merge→export→eval gate passes, upload the merged Warden to | |
| `build-small-hackathon/warden-nemotron-30b` and point the Space there — | |
| the Space then runs the *finetuned* Warden. | |
| 4. Keep the root `Dockerfile` (it no longer drives the Space, but stays the | |
| local/self-host path); frontmatter flips `sdk: docker` → `sdk: gradio` + | |
| `app_file: space/app.py` + `python_version: 3.12` when we cut over. | |
| 5. `SCRYPT_API_KEY` is **no longer needed** on the Space. Scripted-Warden | |
| fallback stays as the safety net (quota exhausted / GPU queue too long). | |
| ## Your steps (interactive auth — run with `!` prefix) | |
| ```bash | |
| # once: fix the root-owned HF cache that crashes the hf CLI, then login | |
| sudo chown -R imjonezz:imjonezz ~/.cache/huggingface | |
| hf auth login | |
| # decision gate #1: try creating under the org with Gradio SDK | |
| hf repo create build-small-hackathon/scrypt --repo-type space --space-sdk gradio | |
| # then in Space settings → check whether ZeroGPU is offered under Hardware. | |
| # If not offered: hf repo create scrypt --repo-type space --space-sdk gradio (personal) | |
| git push space main # remote already wired; update URL first if personal | |
| ``` | |
| No secrets to set in the ZeroGPU plan. First build ~5 min (model download | |
| ~17GB happens at first startup). Smoke test: `/` CRT page, `/api/whisper`, | |
| full run via `/play`, and watch quota burn in the Space's ZeroGPU panel. | |