Spaces:
Running on Zero
Running on Zero
docs: align frontend streaming architecture
Browse filesCo-authored-by: Codex <noreply@openai.com>
DESIGN.md
CHANGED
|
@@ -32,9 +32,11 @@ authoritative decision log; the rest of the doc is written to be consistent with
|
|
| 32 |
**Verified corrections:**
|
| 33 |
- **Drop SGLang.** It needs a persistent GPU process β incompatible with ZeroGPU (same root cause as vLLM). Run
|
| 34 |
MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
|
| 35 |
-
- **gr.Server
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
| 38 |
- **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") β auto-entered; a free
|
| 39 |
lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
|
| 40 |
- **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 β 6/6.
|
|
@@ -54,8 +56,8 @@ authoritative decision log; the rest of the doc is written to be consistent with
|
|
| 54 |
|
| 55 |
**Day-1 go/no-go spikes (before any feature work):**
|
| 56 |
- Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
|
| 57 |
-
- `gr.Server` minimal: static `index.html` + one `
|
| 58 |
-
on the real ZeroGPU Space.
|
| 59 |
- Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4; else Parakeet).
|
| 60 |
|
| 61 |
---
|
|
@@ -143,8 +145,9 @@ With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B t
|
|
| 143 |
|
| 144 |
- **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
|
| 145 |
RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
|
| 146 |
-
- **Text-first runtime loop:** user types β
|
| 147 |
-
|
|
|
|
| 148 |
- **Voice (later bonus):** push-to-talk records an utterance β POST blob β the same `@spaces.GPU` call also runs
|
| 149 |
Nemotron/Parakeet ASR (batch) before the brain. No persistent stream, no WebRTC, **no TURN server**.
|
| 150 |
- **Modal (build-time only):** crawl the org + build the EmbeddingGemma index offline; the Space ships with the index
|
|
@@ -324,8 +327,9 @@ score`) into one *code* "research" action the model calls once. The degradation
|
|
| 324 |
- `import spaces; @spaces.GPU(duration=β¦)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
|
| 325 |
- Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
|
| 326 |
decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day β never idle-hold the GPU.
|
| 327 |
-
- **Frontend β backend via `
|
| 328 |
-
|
|
|
|
| 329 |
- All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
|
| 330 |
|
| 331 |
---
|
|
@@ -381,8 +385,9 @@ No TTS β the **visual output is the agent's "voice"**; it must carry the delig
|
|
| 381 |
TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (Β§2): a candlelit tree-hollow with a heavy
|
| 382 |
open grimoire as the hero component.
|
| 383 |
|
| 384 |
-
- `gradio.Server` is a FastAPI subclass
|
| 385 |
-
|
|
|
|
| 386 |
```python
|
| 387 |
from gradio import Server
|
| 388 |
from fastapi.responses import HTMLResponse
|
|
@@ -397,7 +402,8 @@ open grimoire as the hero component.
|
|
| 397 |
async def home(): return open("index.html").read()
|
| 398 |
app.launch()
|
| 399 |
```
|
| 400 |
-
- Frontend calls via `
|
|
|
|
| 401 |
- **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
|
| 402 |
tokens); `search_projects`/overlap β **bleed** animation + page-number citations (real titles on hover);
|
| 403 |
`find_whitespace` β **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
|
|
@@ -441,8 +447,8 @@ lever is Β§11 custom-UI polish.
|
|
| 441 |
|
| 442 |
## 13. Risks / open items
|
| 443 |
|
| 444 |
-
1. **Day-1 spikes are go/no-go** (Β§1): ZeroGPU hello-cuda build; gr.Server
|
| 445 |
-
batch in `@spaces.GPU` (else Parakeet). Do these before feature work.
|
| 446 |
2. **EmbeddingGemma is gated** β accept Gemma terms + `HF_TOKEN` before any crawl/build.
|
| 447 |
3. **MiniCPM5 tool-call reliability at 1B** β covered by the degradation ladder (Β§8); validate name+args in code.
|
| 448 |
4. **NVIDIA Quest brand** β Parakeet is not "Nemotron"-branded; confirm eligibility with organizers or keep Nemotron
|
|
@@ -464,7 +470,7 @@ lever is Β§11 custom-UI polish.
|
|
| 464 |
2. **`tools.py`** β research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
|
| 465 |
3. **`agent.py`** β 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
|
| 466 |
4. **`app.py`** β `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
|
| 467 |
-
`
|
| 468 |
5. **Well-Tuned LoRA** β small fine-tune on Modal β publish to Hub (β 6/6 badges).
|
| 469 |
6. **Polish + submission** β demo video + social post (Best Demo / Community Choice), publish agent trace (π‘),
|
| 470 |
write up Field Notes (π).
|
|
|
|
| 32 |
**Verified corrections:**
|
| 33 |
- **Drop SGLang.** It needs a persistent GPU process β incompatible with ZeroGPU (same root cause as vLLM). Run
|
| 34 |
MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
|
| 35 |
+
- **gr.Server custom UI streaming IS shipped** (the launch blog only deferred the *explanation*). The deployed browser
|
| 36 |
+
UI calls our own same-origin `/api/agent-turn` NDJSON stream with `fetch`; `_engine_turn` itself is wrapped in
|
| 37 |
+
`@spaces.GPU`, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The `@app.api("/agent_turn")` generator stays
|
| 38 |
+
available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN
|
| 39 |
+
`@gradio/client` path after real Space testing showed that browser turn could hang while the backend completed.
|
| 40 |
- **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") β auto-entered; a free
|
| 41 |
lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
|
| 42 |
- **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 β 6/6.
|
|
|
|
| 56 |
|
| 57 |
**Day-1 go/no-go spikes (before any feature work):**
|
| 58 |
- Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
|
| 59 |
+
- `gr.Server` minimal: static `index.html` + one same-origin `/api/agent-turn` NDJSON stream, plus the retained
|
| 60 |
+
`@app.api()` generator for external clients, on the real ZeroGPU Space.
|
| 61 |
- Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4; else Parakeet).
|
| 62 |
|
| 63 |
---
|
|
|
|
| 145 |
|
| 146 |
- **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
|
| 147 |
RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
|
| 148 |
+
- **Text-first runtime loop:** user types β custom `/api/agent-turn` NDJSON endpoint β one `@spaces.GPU` call runs
|
| 149 |
+
MiniCPM5 (tool loop, in `transformers`) β streamed text tokens + live visual updates. The `@app.api()` endpoint
|
| 150 |
+
remains as the Gradio-client contract for external checks.
|
| 151 |
- **Voice (later bonus):** push-to-talk records an utterance β POST blob β the same `@spaces.GPU` call also runs
|
| 152 |
Nemotron/Parakeet ASR (batch) before the brain. No persistent stream, no WebRTC, **no TURN server**.
|
| 153 |
- **Modal (build-time only):** crawl the org + build the EmbeddingGemma index offline; the Space ships with the index
|
|
|
|
| 327 |
- `import spaces; @spaces.GPU(duration=β¦)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
|
| 328 |
- Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
|
| 329 |
decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day β never idle-hold the GPU.
|
| 330 |
+
- **Frontend β backend via same-origin `fetch("/api/agent-turn")`** reading NDJSON from our FastAPI route. The GPU
|
| 331 |
+
boundary is `_engine_turn`, decorated with `@spaces.GPU`; `@app.api()` endpoints remain available for Gradio-client
|
| 332 |
+
tests and external callers.
|
| 333 |
- All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
|
| 334 |
|
| 335 |
---
|
|
|
|
| 385 |
TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (Β§2): a candlelit tree-hollow with a heavy
|
| 386 |
open grimoire as the hero component.
|
| 387 |
|
| 388 |
+
- `gradio.Server` is a FastAPI subclass serving **your own frontend** while still exposing `@app.api(name=...)`
|
| 389 |
+
functions for Gradio/Python clients. The visible app uses first-party `@app.post()` endpoints for deterministic
|
| 390 |
+
browser behavior; the GPU boundary stays in the decorated engine function.
|
| 391 |
```python
|
| 392 |
from gradio import Server
|
| 393 |
from fastapi.responses import HTMLResponse
|
|
|
|
| 402 |
async def home(): return open("index.html").read()
|
| 403 |
app.launch()
|
| 404 |
```
|
| 405 |
+
- Frontend calls via `fetch("/api/agent-turn")`, parses newline-delimited JSON events, and updates the grimoire as
|
| 406 |
+
`start` / `token` / `done` messages arrive. Notes and chapter exports use `/api/field-notes` and `/api/chapter`.
|
| 407 |
- **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
|
| 408 |
tokens); `search_projects`/overlap β **bleed** animation + page-number citations (real titles on hover);
|
| 409 |
`find_whitespace` β **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
|
|
|
|
| 447 |
|
| 448 |
## 13. Risks / open items
|
| 449 |
|
| 450 |
+
1. **Day-1 spikes are go/no-go** (Β§1): ZeroGPU hello-cuda build; gr.Server same-origin NDJSON browser streaming;
|
| 451 |
+
Nemotron batch in `@spaces.GPU` (else Parakeet). Do these before feature work.
|
| 452 |
2. **EmbeddingGemma is gated** β accept Gemma terms + `HF_TOKEN` before any crawl/build.
|
| 453 |
3. **MiniCPM5 tool-call reliability at 1B** β covered by the degradation ladder (Β§8); validate name+args in code.
|
| 454 |
4. **NVIDIA Quest brand** β Parakeet is not "Nemotron"-branded; confirm eligibility with organizers or keep Nemotron
|
|
|
|
| 470 |
2. **`tools.py`** β research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
|
| 471 |
3. **`agent.py`** β 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
|
| 472 |
4. **`app.py`** β `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
|
| 473 |
+
first-party `/api/...` endpoints; concept skin applied.
|
| 474 |
5. **Well-Tuned LoRA** β small fine-tune on Modal β publish to Hub (β 6/6 badges).
|
| 475 |
6. **Polish + submission** β demo video + social post (Best Demo / Community Choice), publish agent trace (π‘),
|
| 476 |
write up Field Notes (π).
|