Spaces:

build-small-hackathon
/

Scrypt

Running on Zero

App Files Files Community

Scrypt / DEPLOY.md

IMJONEZZ

DEPLOY: switch plan to ZeroGPU (Gradio SDK, on-Space inference, no API key)

11143b6 18 days ago

preview code

Raw

History Blame Contribute Delete

4.14 kB

	# Deploying SCRYPT to HuggingFace — ZeroGPU plan

	Target: `build-small-hackathon/scrypt`. We deploy on ZeroGPU (HF Pro):
	the Warden runs on the Space itself — no third-party API key, no OpenRouter.
	The Space becomes self-hosted inference, which also answers "isn't the Warden
	supposed to stay local?" — it is; the Space is just someone else's localhost.

	## Hard constraints (verified against HF docs, 2026-06-12)

	- ZeroGPU Spaces are Gradio SDK only — no Docker Spaces. Our gradio.Server
	web layer survives, but it must run inside the Gradio SDK runtime
	(`sdk: gradio` frontmatter + `requirements.txt`), not our Dockerfile.
	- Hosting under an org requires the org to have ZeroGPU enabled
	(Team/Enterprise — hackathon orgs often get it granted). Personal PRO
	accounts can host up to 10 ZeroGPU Spaces. → **Decision gate #1, check
	first:** create the Space under `build-small-hackathon` and see if ZeroGPU
	appears in the hardware options. If not: host under your account, transfer
	to the org later (or keep the org Space as a CPU/API mirror).
	- GPU = RTX Pro 6000 Blackwell slice: `large` 48GB VRAM (1× quota) or
	`xlarge` 96GB (2× quota). BF16 30B (~60GB) needs `xlarge`; **4-bit (~18GB)
	fits `large`** — and the local game runs a Q4 GGUF anyway, so 4-bit is
	quality-parity, not a downgrade. Start 4-bit/`large`.
	- Model must be moved to `cuda` at module level (ZeroGPU emulates CUDA at
	startup); GPU attaches only inside `@spaces.GPU(duration=...)` functions.
	Default 60s/call; visitors burn daily quota (2 min unauthenticated /
	5 min free / 40 min PRO) — our calls are ~1.5K-token prefill + short
	generations on a 3.5B-active MoE, i.e. seconds per call. Plenty.

	## Architecture changes (space/ rework)

	1. `app.py` stays a Gradio-hosted FastAPI: build a minimal `gr.Blocks` (the
	"engine room" — can be a hidden status page), launch it, and attach our
	routes (`/` CRT page, `/api/whisper`, `/play`, `WS /pty`) to gradio's
	FastAPI app. Spike risk: validate custom routes + websocket survive the
	Space proxy in a throwaway ZeroGPU Space before porting everything.
	2. Inference: per-visitor game PTYs are subprocesses and cannot call
	`@spaces.GPU` themselves. The main process exposes an internal
	OpenAI-style endpoint (`POST /v1/internal/generate`) whose handler is the
	`@spaces.GPU` generator (transformers + `TextIteratorStreamer`,
	bitsandbytes 4-bit, `trust_remote_code`). Game subprocesses run
	`SCRYPT_BACKEND=api` pointed at `http://127.0.0.1:7860/...` — the existing
	api backend, new base URL. No game-code changes.
	3. Model source: until the finetune ships, load
	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` quantized 4-bit at startup.
	After the merge→export→eval gate passes, upload the merged Warden to
	`build-small-hackathon/warden-nemotron-30b` and point the Space there —
	the Space then runs the finetuned Warden.
	4. Keep the root `Dockerfile` (it no longer drives the Space, but stays the
	local/self-host path); frontmatter flips `sdk: docker` → `sdk: gradio` +
	`app_file: space/app.py` + `python_version: 3.12` when we cut over.
	5. `SCRYPT_API_KEY` is no longer needed on the Space. Scripted-Warden
	fallback stays as the safety net (quota exhausted / GPU queue too long).

	## Your steps (interactive auth — run with `!` prefix)

	```bash
	# once: fix the root-owned HF cache that crashes the hf CLI, then login
	sudo chown -R imjonezz:imjonezz ~/.cache/huggingface
	hf auth login

	# decision gate #1: try creating under the org with Gradio SDK
	hf repo create build-small-hackathon/scrypt --repo-type space --space-sdk gradio
	# then in Space settings → check whether ZeroGPU is offered under Hardware.
	# If not offered: hf repo create scrypt --repo-type space --space-sdk gradio (personal)

	git push space main # remote already wired; update URL first if personal
	```

	No secrets to set in the ZeroGPU plan. First build ~5 min (model download
	~17GB happens at first startup). Smoke test: `/` CRT page, `/api/whisper`,
	full run via `/play`, and watch quota burn in the Space's ZeroGPU panel.