JacobLinCool Codex commited on
Commit
3ee3ed0
Β·
verified Β·
1 Parent(s): 151c180

docs: align frontend streaming architecture

Browse files

Co-authored-by: Codex <noreply@openai.com>

Files changed (1) hide show
  1. DESIGN.md +21 -15
DESIGN.md CHANGED
@@ -32,9 +32,11 @@ authoritative decision log; the rest of the doc is written to be consistent with
32
  **Verified corrections:**
33
  - **Drop SGLang.** It needs a persistent GPU process β†’ incompatible with ZeroGPU (same root cause as vLLM). Run
34
  MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
35
- - **gr.Server SSE generator streaming IS shipped** (the launch blog only deferred the *explanation*). On ZeroGPU the
36
- browser MUST call endpoints via **`@gradio/client`** (`client.predict`/`submit`) β€” it forwards the HF iframe auth
37
- headers for GPU quota; a raw `fetch`/`EventSource` POST silently breaks quota.
 
 
38
  - **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") β†’ auto-entered; a free
39
  lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
40
  - **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 β†’ 6/6.
@@ -54,8 +56,8 @@ authoritative decision log; the rest of the doc is written to be consistent with
54
 
55
  **Day-1 go/no-go spikes (before any feature work):**
56
  - Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
57
- - `gr.Server` minimal: static `index.html` + one `@app.api()` generator streaming tokens, called via `@gradio/client`,
58
- on the real ZeroGPU Space.
59
  - Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4; else Parakeet).
60
 
61
  ---
@@ -143,8 +145,9 @@ With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B t
143
 
144
  - **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
145
  RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
146
- - **Text-first runtime loop:** user types β†’ `gr.Server` `@app.api()` endpoint (called via `@gradio/client`) β†’ one
147
- `@spaces.GPU` call runs MiniCPM5 (tool loop, in `transformers`) β†’ SSE-stream text tokens + drive live visuals.
 
148
  - **Voice (later bonus):** push-to-talk records an utterance β†’ POST blob β†’ the same `@spaces.GPU` call also runs
149
  Nemotron/Parakeet ASR (batch) before the brain. No persistent stream, no WebRTC, **no TURN server**.
150
  - **Modal (build-time only):** crawl the org + build the EmbeddingGemma index offline; the Space ships with the index
@@ -324,8 +327,9 @@ score`) into one *code* "research" action the model calls once. The degradation
324
  - `import spaces; @spaces.GPU(duration=…)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
325
  - Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
326
  decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day β†’ never idle-hold the GPU.
327
- - **Frontend β†’ backend via `@gradio/client`** (`client.predict`/`submit`), NOT raw fetch β€” it forwards the HF iframe
328
- auth headers ZeroGPU needs for quota. Generator `@app.api()` endpoints stream tokens over SSE.
 
329
  - All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
330
 
331
  ---
@@ -381,8 +385,9 @@ No TTS β†’ the **visual output is the agent's "voice"**; it must carry the delig
381
  TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (Β§2): a candlelit tree-hollow with a heavy
382
  open grimoire as the hero component.
383
 
384
- - `gradio.Server` is a FastAPI subclass keeping Gradio's queue/SSE/ZeroGPU/`gradio_client` engine while serving **your
385
- own frontend**. `@app.api(name=...)` fns are queued + client-callable + ZeroGPU-aware; plain `@app.post()` are not.
 
386
  ```python
387
  from gradio import Server
388
  from fastapi.responses import HTMLResponse
@@ -397,7 +402,8 @@ open grimoire as the hero component.
397
  async def home(): return open("index.html").read()
398
  app.launch()
399
  ```
400
- - Frontend calls via `@gradio/client`: `client.predict("/agent_turn", { message })` (NOT raw fetch β€” ZeroGPU auth).
 
401
  - **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
402
  tokens); `search_projects`/overlap β†’ **bleed** animation + page-number citations (real titles on hover);
403
  `find_whitespace` β†’ **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
@@ -441,8 +447,8 @@ lever is Β§11 custom-UI polish.
441
 
442
  ## 13. Risks / open items
443
 
444
- 1. **Day-1 spikes are go/no-go** (Β§1): ZeroGPU hello-cuda build; gr.Server `@gradio/client` SSE streaming; Nemotron
445
- batch in `@spaces.GPU` (else Parakeet). Do these before feature work.
446
  2. **EmbeddingGemma is gated** β€” accept Gemma terms + `HF_TOKEN` before any crawl/build.
447
  3. **MiniCPM5 tool-call reliability at 1B** β€” covered by the degradation ladder (Β§8); validate name+args in code.
448
  4. **NVIDIA Quest brand** β€” Parakeet is not "Nemotron"-branded; confirm eligibility with organizers or keep Nemotron
@@ -464,7 +470,7 @@ lever is Β§11 custom-UI polish.
464
  2. **`tools.py`** β€” research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
465
  3. **`agent.py`** β€” 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
466
  4. **`app.py`** β€” `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
467
- `@gradio/client`; concept skin applied.
468
  5. **Well-Tuned LoRA** β€” small fine-tune on Modal β†’ publish to Hub (β†’ 6/6 badges).
469
  6. **Polish + submission** β€” demo video + social post (Best Demo / Community Choice), publish agent trace (πŸ“‘),
470
  write up Field Notes (πŸ““).
 
32
  **Verified corrections:**
33
  - **Drop SGLang.** It needs a persistent GPU process β†’ incompatible with ZeroGPU (same root cause as vLLM). Run
34
  MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
35
+ - **gr.Server custom UI streaming IS shipped** (the launch blog only deferred the *explanation*). The deployed browser
36
+ UI calls our own same-origin `/api/agent-turn` NDJSON stream with `fetch`; `_engine_turn` itself is wrapped in
37
+ `@spaces.GPU`, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The `@app.api("/agent_turn")` generator stays
38
+ available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN
39
+ `@gradio/client` path after real Space testing showed that browser turn could hang while the backend completed.
40
  - **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") β†’ auto-entered; a free
41
  lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
42
  - **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 β†’ 6/6.
 
56
 
57
  **Day-1 go/no-go spikes (before any feature work):**
58
  - Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
59
+ - `gr.Server` minimal: static `index.html` + one same-origin `/api/agent-turn` NDJSON stream, plus the retained
60
+ `@app.api()` generator for external clients, on the real ZeroGPU Space.
61
  - Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4; else Parakeet).
62
 
63
  ---
 
145
 
146
  - **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
147
  RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
148
+ - **Text-first runtime loop:** user types β†’ custom `/api/agent-turn` NDJSON endpoint β†’ one `@spaces.GPU` call runs
149
+ MiniCPM5 (tool loop, in `transformers`) β†’ streamed text tokens + live visual updates. The `@app.api()` endpoint
150
+ remains as the Gradio-client contract for external checks.
151
  - **Voice (later bonus):** push-to-talk records an utterance β†’ POST blob β†’ the same `@spaces.GPU` call also runs
152
  Nemotron/Parakeet ASR (batch) before the brain. No persistent stream, no WebRTC, **no TURN server**.
153
  - **Modal (build-time only):** crawl the org + build the EmbeddingGemma index offline; the Space ships with the index
 
327
  - `import spaces; @spaces.GPU(duration=…)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
328
  - Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
329
  decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day β†’ never idle-hold the GPU.
330
+ - **Frontend β†’ backend via same-origin `fetch("/api/agent-turn")`** reading NDJSON from our FastAPI route. The GPU
331
+ boundary is `_engine_turn`, decorated with `@spaces.GPU`; `@app.api()` endpoints remain available for Gradio-client
332
+ tests and external callers.
333
  - All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
334
 
335
  ---
 
385
  TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (Β§2): a candlelit tree-hollow with a heavy
386
  open grimoire as the hero component.
387
 
388
+ - `gradio.Server` is a FastAPI subclass serving **your own frontend** while still exposing `@app.api(name=...)`
389
+ functions for Gradio/Python clients. The visible app uses first-party `@app.post()` endpoints for deterministic
390
+ browser behavior; the GPU boundary stays in the decorated engine function.
391
  ```python
392
  from gradio import Server
393
  from fastapi.responses import HTMLResponse
 
402
  async def home(): return open("index.html").read()
403
  app.launch()
404
  ```
405
+ - Frontend calls via `fetch("/api/agent-turn")`, parses newline-delimited JSON events, and updates the grimoire as
406
+ `start` / `token` / `done` messages arrive. Notes and chapter exports use `/api/field-notes` and `/api/chapter`.
407
  - **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
408
  tokens); `search_projects`/overlap β†’ **bleed** animation + page-number citations (real titles on hover);
409
  `find_whitespace` β†’ **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
 
447
 
448
  ## 13. Risks / open items
449
 
450
+ 1. **Day-1 spikes are go/no-go** (Β§1): ZeroGPU hello-cuda build; gr.Server same-origin NDJSON browser streaming;
451
+ Nemotron batch in `@spaces.GPU` (else Parakeet). Do these before feature work.
452
  2. **EmbeddingGemma is gated** β€” accept Gemma terms + `HF_TOKEN` before any crawl/build.
453
  3. **MiniCPM5 tool-call reliability at 1B** β€” covered by the degradation ladder (Β§8); validate name+args in code.
454
  4. **NVIDIA Quest brand** β€” Parakeet is not "Nemotron"-branded; confirm eligibility with organizers or keep Nemotron
 
470
  2. **`tools.py`** β€” research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
471
  3. **`agent.py`** β€” 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
472
  4. **`app.py`** β€” `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
473
+ first-party `/api/...` endpoints; concept skin applied.
474
  5. **Well-Tuned LoRA** β€” small fine-tune on Modal β†’ publish to Hub (β†’ 6/6 badges).
475
  6. **Polish + submission** β€” demo video + social post (Best Demo / Community Choice), publish agent trace (πŸ“‘),
476
  write up Field Notes (πŸ““).