Spaces:
Running on Zero
Running on Zero
File size: 28,420 Bytes
f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 3ee3ed0 f44aac9 3ee3ed0 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 ca766b5 e12a049 f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 3ee3ed0 7d1e08d e12a049 f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 ca766b5 f44aac9 ca766b5 e12a049 f44aac9 e12a049 f44aac9 e12a049 ca766b5 7d1e08d f44aac9 ca766b5 f44aac9 9d8fa02 f44aac9 9d8fa02 f44aac9 9d8fa02 f44aac9 9d8fa02 f44aac9 3ee3ed0 f44aac9 e12a049 f44aac9 e12a049 ca766b5 e12a049 f44aac9 3ee3ed0 f44aac9 3ee3ed0 f44aac9 7d1e08d f44aac9 7d1e08d e12a049 f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 3ee3ed0 f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d f44aac9 7d1e08d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 | # Build Small Hackathon Advisor โ Design & Implementation Notes
> A **small-model agent** with text and voice input that investigates what other people have already built
> for the [Build Small Hackathon](https://huggingface.co/build-small-hackathon) and brainstorms an original new design
> *with you*. Output = streaming text + live visuals (no TTS). All models small, open-weight, run locally.
>
> The literal "advisor" is the **engine**; the user-facing experience is **The Unwritten Almanac** โ Mothback, an
> owl-moth archivist, keeps the Wood's book of fates and divines you a still-unwritten project page (ink **bleeds +
> cites real Spaces** if you overlap, **blooms gold** if it's new). This project is itself a Build Small submission
> (hack window 2026-06-05 โ 2026-06-15).
---
## 1. Locked decisions & review corrections (2026-06-07)
A multi-agent adversarial review (5 dimensions, web-verified) set the direction below. **This section is the
authoritative decision log; the rest of the doc is written to be consistent with it.**
**Locked decisions (Jacob):**
1. **Concept = The Unwritten Almanac** (chosen 2026-06-07 from a 12-concept brainstorm). Mothback the owl-moth
archivist divines a fate-page; ink **bleeds and cites the real Spaces** you overlap (page 47, page 112โฆ), or
**blooms gold + sprouts a leaf** when it's unwritten. Engine unchanged underneath (crawl โ whitespace/originality โ
score). The dry "advisor" stays under the hood. Full spec + de-risking grafts in ยง2.
2. **Text-first with voice input.** The core workflow remains typed/editable text. Voice records or uploads a note,
transcribes it with batch ASR, and places the transcript in the same idea box. Real-time streaming + in-browser turn
detection are **deferred**.
3. **Add a ๐ฏ Well-Tuned fine-tune** โ a small LoRA (MiniCPM5 advisor persona / tool-calling), trained on Modal,
published to the Hub โ 6/6 badges โ strong shot at ๐๏ธ Bonus Quest Champion ($2,000).
4. **ASR = Nemotron batch.** `nvidia/nemotron-speech-streaming-en-0.6b` runs through NVIDIA NeMo in a ZeroGPU function.
Audio is normalized to mono WAV before calling `transcribe([wav])`.
**Verified corrections:**
- **Drop SGLang.** It needs a persistent GPU process โ incompatible with ZeroGPU (same root cause as vLLM). Run
MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
- **gr.Server custom UI streaming IS shipped** (the launch blog only deferred the *explanation*). The deployed browser
UI calls our own same-origin `/api/agent-turn` NDJSON stream with `fetch`; `_engine_turn` itself is wrapped in
`@spaces.GPU`, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The `@app.api("/agent_turn")` generator stays
available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN
`@gradio/client` path after real Space testing showed that browser turn could hang while the backend completed.
- **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") โ auto-entered; a free
lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
- **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 โ 6/6.
- **Tiny Titan** = "best โค4B model"; our largest single model is MiniCPM5 (1.08B), total stack ~1.9B โ eligible.
**New build requirements surfaced by the review (designed into the sections below):**
- **Jargon alias layer (ยง7):** a 0.6B ASR mistranscribes our own vocab (Nemotron, MiniCPM5, EmbeddingGemma, ZeroGPUโฆ).
Deterministic code-side fuzzy/alias map over our small CLOSED vocab, applied before any tool call and before display.
Surface "heard: neutron โ Nemotron" as a delightful trust moment. (Active once voice is added.)
- **Tool-call degradation ladder (ยง8):** the 1B brain WILL emit broken tool calls (MiniCPM5-1B has a documented
"broken tool calling" report). Wrap parse in try/except, retry once at low temp, validate name+args vs JSON-Schema in
code (reject-and-repair), canned lines for empty results, a token **watchdog** that shows "trying again" instead of
dead air (the screen is the only feedback channel โ no TTS).
- **Latency / optimistic UI (ยง9/ยง11):** ZeroGPU cold start + 1B generation = seconds of potential dead air. Optimistic
UI on submit, pre-animate the project wall, set a latency budget. (The torch.compile cold-start penalty does NOT
apply โ we don't use it.)
**Day-1 go/no-go spikes (before any feature work):**
- Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
- `gr.Server` minimal: static `index.html` + one same-origin `/api/agent-turn` NDJSON stream, plus the retained
`@app.api()` generator for external clients, on the real ZeroGPU Space.
- Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4).
---
## 2. Concept โ The Unwritten Almanac (text-first)
The engine, regardless of skin:
1. **Investigate** the `build-small-hackathon` HF org โ what Spaces exist, which models, what's saturated, and where
the **whitespace** is โ using a local EmbeddingGemma index.
2. **Brainstorm** with the user: propose ideas, **score** them against a fixed rubric (originality vs. existing
projects, delight, AI-necessity, feasibility, param budget, prize-fit), and maintain an **idea board**.
3. **Respond** as streaming text + live visuals in a custom `gr.Server` frontend (no TTS โ the visual is the "voice").
**The skin (chosen): The Unwritten Almanac.** **Mothback**, a dusty owl-moth archivist, keeps the Wood's *book of
fates*. Every project already built in the org is an inked page; she divines you a destined entry on a still-blank page,
the ink writing itself live.
**The two-beat wow (this IS the engine, rendered):**
- You type one line about yourself / your idea. Inked pages riffle past (each = a real crawled Space).
- **Bleed:** if your idea overlaps existing work, the ink **seeps blood-red** and cites the exact real Spaces โ "the
Wood already wrote this, on page 47 and page 112" (= `get_project` overlap on the top retrieval hits). The burn is
**factual**, so it can't fall flat the way a 1B's invented joke can.
- **Bloom:** you say "write bolder"; the next entry flows **gold**, a green leaf sprouts โ "this page has never been
inked" (= a `find_whitespace` gold candidate).
- A **wax seal** presses in, lighting five quadrants as the idea qualifies (= `score_idea`: Originality, Delight,
AI-Necessity, Feasibility, Prize-Fit).
**Engine โ skin mapping:** `search_projects`/`get_project` overlap โ the bleed + citations; `find_whitespace` โ the
blank/gold pages; `score_idea` โ the wax-seal quadrants; `save_idea` โ the written fate-page; agent persona =
**Mothback** (Layer A system prompt + the ๐ฏ Well-Tuned LoRA = her voice).
**Shareable artifact (Community Choice):** the page exports as a PNG that looks **torn from an ancient grimoire** โ
aged parchment, a coined fate-name as title, the self-written prophecy, the five-quadrant seal, and a verdict stamp
(**"UNWRITTEN ยท 0 echoes"** vs **"ECHO ร3"**). Built-in caption: "Mothback inked my fate page for #BuildSmall โ
UNWRITTEN." People compile draws into a "chapter" and dare friends to get a page that doesn't bleed.
**Grafted de-risking (from runner-up concepts):**
- **Tone = dry-but-benevolent** (Roastleaf's whiplash): the bleed-citation gently stings, the gold-bloom is sincerely
delighted; the burn is true-by-construction (real cited Spaces).
- **Templated structure (key risk-killer):** bank entry/roast templates (citation + dry verdict + redemptive branch);
the 1B only fills in real Space titles + the idea โ **never improvises whole comedy**.
- **Latin-binomial fate-names** (e.g. "Ludus Vocalis Infantium") via templated scaffolds โ built-in wit, backstops a
1B that might produce corny names.
- **"You vs the Wood" margin glyph:** a tiny cluster-dot thumbnail on the page showing your gold page among the inked
crowd โ cheap SVG, visual PROOF the gap is real.
- **Thin-org mitigation (load-bearing):** precompute whitespace clusters at Modal build-time and pin several DISTINCT
blank-page candidates so "write bolder" always lands on a real, varied gap (the org may be only ~30โ60 Spaces). Tune
the echo threshold toward *more frequent bleed* so the demo always has its "low" before the "wow".
**Defaults (revisit if time):** single-page artifact first (chapter compiler later); page-numbers visible, real titles
on hover (keep the burn aimed at the idea, not a named builder); seal animation = safe typewriter + static-stamp floor
first, bespoke ink-reveal last. Voice input is batch ASR that fills the same idea box before the user presses Ink.
Input is **text-first**; the experience is fully delightful with typed input alone.
AI is genuinely **load-bearing**: embeddings power the whitespace/originality analysis and the LLM drives the
investigate โ ideate โ score loop โ the experience collapses without the models (supports ๐ค Best Agent + TTW
"AI necessity").
---
## 3. Model stack (confirmed exact repo IDs)
| Role | Model | Params | Runtime | License | Prize hook |
|---|---|---|---|---|---|
| STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | ๐ฉ NVIDIA Nemotron Quest |
| LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | ๐ฎ OpenBMB |
| Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | ๐ Off the Grid ยท ๐ฆ Llama Champion ยท ๐ข Modal |
| Fine-tune | LoRA on MiniCPM5 โ published to Hub | โ | PEFT / HF Jobs | โ | ๐ฏ Well-Tuned |
**Total โ 1.98B params โ โค4B โ ๐ Tiny Titan eligible.** All open-weight, all runnable locally โ ๐ Off the Grid.
> Naming: "OpenCPM5 1B" = `openbmb/MiniCPM5-1B` (MiniCPM 5.0, ~May 2026). "EmbeddingGemma 270M" =
> `google/embeddinggemma-300m` (308M total; 270M = non-embedding transformer params). **SGLang dropped** (ZeroGPU
> incompatible). STT is used in **batch voice-note** mode, not a persistent stream.
---
## 4. Deployment & architecture (single path)
With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B tension dissolves โ there is one path:
- **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
- **Text-first runtime loop:** user types โ custom `/api/agent-turn` NDJSON endpoint โ one `@spaces.GPU` call runs
MiniCPM5 (tool loop, in `transformers`) โ streamed text tokens + live visual updates. The `@app.api()` endpoint
remains as the Gradio-client contract for external checks.
- **Voice input:** push-to-talk records an utterance or uploads a voice note โ `/api/transcribe` normalizes audio with
ffmpeg โ one `@spaces.GPU` call runs Nemotron ASR through NeMo โ transcript fills the idea box. No persistent stream,
no WebRTC, **no TURN server**.
- **Modal (build-time only):** crawl the org + build the llama.cpp EmbeddingGemma vector index offline; the Space ships
with checked-in project vectors. Runtime never calls Modal โ ๐ Off the Grid holds (see ยง10).
> Off the Grid = no proprietary cloud inference APIs. Open weights on an HF GPU Space / local box / Modal all qualify.
**Deferred:** real-time streaming ASR and turn detection are not part of the shipped app.
---
## 5. Per-model implementation notes
### 5.1 ASR โ `nvidia/nemotron-speech-streaming-en-0.6b` (batch)
- **Primary, batch usage (simple):**
```python
import nemo.collections.asr as nemo_asr
asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-speech-streaming-en-0.6b")
text = asr.transcribe(["utterance.wav"]) # 16 kHz mono WAV in; punctuated EN text out
```
Runtime install: `packages.txt` provides `ffmpeg` and `libsndfile1`; `requirements.txt` pins
`nemo_toolkit[asr]==2.7.3` plus Cython and packaging. The app records or uploads audio, normalizes it to mono
16 kHz WAV, runs NeMo in a ZeroGPU function, then returns the transcript to the idea box. Hosted NVIDIA NIM API would
break Off the Grid, so it is not used.
### 5.2 MiniCPM5-1B brain โ `openbmb/MiniCPM5-1B` (transformers, self-parsed XML)
- Context 128K, bilingual (EN/ZH), Apache-2.0. `enable_thinking=False`, `temperature=0.7, top_p=0.95` for fast tool calls.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B", torch_dtype="auto", device_map="auto")
inputs = tok.apply_chat_template(messages, tools=TOOLS, add_generation_prompt=True, enable_thinking=False,
tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
```
- **Tool calling:** pass JSON-Schema tools via the chat template `tools=` arg; the model emits **XML**
`<function name="get_weather">{"city":"New York"}</function>`. **Parse this ourselves** (SGLang dropped). Wrap parse
in try/except and validate against the schema โ see the degradation ladder (ยง8).
- **Local / CPU & llama.cpp (Off the Grid ยท Llama Champion):** `openbmb/MiniCPM5-1B-GGUF:Q4_K_M` (688 MB) via llama.cpp
or Ollama (CPU-viable). fp16 โ 3โ4 GB VRAM. `openbmb/MiniCPM5-1B-MLX` for Apple Silicon. (llama.cpp MiniCPM5
tool-calling is a pending PR โ verify before relying on it for the badge runtime.)
- **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
### 5.4 EmbeddingGemma GGUF โ `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`
- Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
- Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
- Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
over checked-in project vectors.
- Evidence is recorded in index metadata: model repo, GGUF filename, runtime, dimensions, build source, builder script,
llama-cpp-python version, and Modal app name.
### 5.5 llama.cpp support (๐ฆ Llama Champion)
The active Llama Champion path is the retrieval model: the project index is built with EmbeddingGemma GGUF through
llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
| Model | llama.cpp? | Runtime | Notes |
|---|---|---|---|
| `openbmb/MiniCPM5-1B` | โ
planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
| `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | โ
active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
| ASR (Nemotron) | โ | NeMo | FastConformer-RNNT |
The checked-in index and runtime query embedder must stay on the same GGUF file.
---
## 6. Agent context design (built for a 1B brain)
Core principle: **the 1B model is a router + arg-filler. All heavy work (crawl, summarize, score, rank, dedup) lives in
code.** Keep live context to ~800โ1200 tokens of *curated* view, never raw data.
- **Layer A โ System (static, ~250 tok):** identity/character; hackathon hard rules (โค32B, Gradio Space, demo video) so
it self-filters infeasible ideas; targeted prizes (biases ideation); reply style (short, one question at a time);
explicit tool-use instructions + the canonical jargon vocabulary (so it can self-correct, ยง7).
- **Layer B โ Session state (re-rendered each turn by code, ~300 tok):** user profile; locked decisions (track, side
quests, models); **idea board** (2โ3 candidates, one line + scores); compact "projects already seen" summary.
- **Layer C โ Ephemeral (~300 tok):** last 2โ3 turns; the most recent tool result as a **refined card** (not raw JSON).
---
## 7. Agent tool design
Few tools, few params each, short descriptions (1B-friendly). Heavy logic in code.
**Jargon alias layer (input normalization).** Before any tool call and before display, run ASR/user text through a
deterministic fuzzy/alias map over our small CLOSED vocab (model names and goal names) โ e.g. RapidFuzz
`token_set_ratio` / double-metaphone โ mapping "neutron"/"nemo tron" โ Nemotron, "mini cpm" โ MiniCPM5, "zero gpu" โ
ZeroGPU. Surface the correction ("heard: neutron โ Nemotron") as a trust-building, slightly delightful moment.
**Research โ investigate existing projects (the core value).** Data = `build-small-hackathon` org Spaces, pre-crawled
into a local snapshot + EmbeddingGemma index (keeps Off the Grid at runtime).
| Tool | Signature | Returns (refined) | Heavy work |
|---|---|---|---|
| `list_projects` | `(track?, sort?)` | top-N project cards | HF Hub API + summarize |
| `search_projects` | `(query)` | top 5 cards | EmbeddingGemma retrieval |
| `get_project` | `(id)` | card + overlap-vs-board verdict | code computes overlap |
| `find_whitespace` | `()` | under-explored niches | cluster the index, find gaps |
`find_whitespace` is the originality engine (TTW judges originality) โ it names where nobody has built yet.
**Ideation / state.**
| Tool | Signature | Purpose |
|---|---|---|
| `save_idea` | `(title, pitch)` | add/update a candidate on the idea board |
| `score_idea` | `(id)` | fixed (hardcoded) rubric โ scores + gaps; the 1B only triggers + verbalizes |
| `compare_ideas` | `()` | rank the board, articulate tradeoffs |
| `make_plan` | `(id)` | build plan + goals the current direction can support |
| `update_profile` | `(field, value)` | record skills/time/prefs โ Layer B |
| `set_goals` | `(goals[])` | change selected goals โ updates Layer A bias |
---
## 8. Agent loop (single-hop + degradation ladder)
```
on user input (text; or voice โ batch ASR โ text):
normalize via jargon alias layer
ctx = LayerA + render_state(LayerB) + last_turns + last_tool_card
out = MiniCPM5(ctx, tools=TOOLS, enable_thinking=False, temp=0.7) # โ tool_call | reply
try: parse XML tool call
except / invalid name|args (vs JSON-Schema): # degradation ladder
retry once (tempโ0.3, "emit ONLY one valid tool call")
still bad โ run a safe default tool (find_whitespace) so the screen never freezes
if tool_call: card = run_tool(out); reply = MiniCPM5(ctx + card) # single follow-up, no long ReAct
empty/zero result โ canned advisor line (never say nothing)
stream reply tokens โ custom UI | token watchdog: no token in N s โ "trying again" visual (not dead air)
update_state(LayerB)
```
**Max one tool-call then reply.** A 1B can't sustain multi-step ReAct; wrap multi-step flows (`search โ get_project โ
score`) into one *code* "research" action the model calls once. The degradation ladder is a **first-class UX surface**
(ยง11), not an error branch โ the screen is the only feedback channel (no TTS).
---
## 9. ZeroGPU deployment notes
- `import spaces; @spaces.GPU(duration=โฆ)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
- Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day โ never idle-hold the GPU.
- **Frontend โ backend via same-origin `fetch("/api/agent-turn")`** reading NDJSON from our FastAPI route. The GPU
boundary is `_engine_turn`, decorated with `@spaces.GPU`; `@app.api()` endpoints remain available for Gradio-client
tests and external callers.
- All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
---
## 10. Modal โ offline pipeline (build-time only โ preserves Off the Grid)
Modal = build-time; runtime never calls it. This is how the app claims **both** ๐ข Modal and ๐ Off the Grid. The
canonical command is:
```bash
.venv/bin/modal run scripts/modal_build_project_index.py \
--projects data/projects.json \
--out data/project_index.json
```
The remote function installs `llama-cpp-python`, downloads
`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
index at `2026-06-07T08:16:19+00:00`.
---
## 11. Frontend โ `gr.Server` custom UI (๐จ Off-Brand)
No TTS โ the **visual output is the agent's "voice"**; it must carry the delight (this is what earns Off-Brand, and the
TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (ยง2): a candlelit tree-hollow with a heavy
open grimoire as the hero component.
- `gradio.Server` is a FastAPI subclass serving **your own frontend** while still exposing `@app.api(name=...)`
functions for Gradio/Python clients. The visible app uses first-party `@app.post()` endpoints for deterministic
browser behavior; the GPU boundary stays in the decorated engine function.
```python
from gradio import Server
from fastapi.responses import HTMLResponse
app = Server()
@app.api(name="agent_turn", concurrency_limit=2)
async def agent_turn(message: str):
for token in run_agent_stream(message): # generator โ SSE
yield token
@app.get("/", response_class=HTMLResponse) # custom UI replaces Gradio's default page
async def home(): return open("index.html").read()
app.launch()
```
- Frontend calls via `fetch("/api/agent-turn")`, parses newline-delimited JSON events, and updates the grimoire as
`start` / `token` / `done` messages arrive. Notes and chapter exports use `/api/field-notes` and `/api/chapter`.
- **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
tokens); `search_projects`/overlap โ **bleed** animation + page-number citations (real titles on hover);
`find_whitespace` โ **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
`score_idea` โ **wax-seal** five-quadrant stamp; the riffling inked pages (fast page-flip of real Spaces) double as
the project-wall; export = the torn-grimoire PNG artifact (ยง2). Jargon-correction toasts (ยง7) read as Mothback's
margin notes; optimistic-UI loading + watchdog states (ยง8) are her "the page is choosing its wordsโฆ". Cheap SFX:
page-flip, quill scratch, wax-seal thunk.
- **Build the animation floor first:** safe typewriter + static stamp ships first (graceful degradation โ the judges
credited this); upgrade the ink-bleed / gold-bloom / seal-press last.
- **Fallback:** the backend (`tools.py`/`agent.py`) is UI-agnostic โ if gr.Server misbehaves, fall back to
`gr.Blocks` + `gr.HTML`, losing only the $1500 Off-Brand badge, never the submission.
---
## 12. Prize mapping
| Target | How it's earned |
|---|---|
| ๐ Thousand Token Wood | **The Unwritten Almanac** (ยง2) โ the bleed-citation wow IS the engine rendered; AI load-bearing; original |
| ๐ Tiny Titan (special, $1.5k) | total ~1.98B, every model โค4B; largest single = MiniCPM5 1.08B |
| ๐ Off the Grid (badge) | all open weights run locally; offline index; no cloud inference at runtime |
| ๐ฏ Well-Tuned (badge) | published LoRA fine-tune of MiniCPM5 on the Hub (ยง10) โ **6/6 badges** |
| ๐จ Off-Brand (badge + $1.5k) | `gr.Server` custom UI is the agent's output surface |
| ๐ฎ OpenBMB ($10k) | brain = MiniCPM5-1B ("OpenBMB pick") |
| ๐ฉ NVIDIA Quest (2ร RTX 5080) | ASR = Nemotron (ยง5.1) |
| ๐ฆ Llama Champion (badge) | EmbeddingGemma GGUF retrieval index and runtime query embeddings run through llama.cpp (ยง5.5) |
| ๐ก Sharing is Caring (badge) | publish the agent's tool-call trace to the Hub |
| ๐ Field Notes (badge) | this DESIGN.md โ a build blog post |
| ๐๏ธ Bonus Quest Champion ($2k) | 6/6 badges (needs the Well-Tuned fine-tune) |
| ๐ค Best Agent ($1k) | real multi-tool loop: investigate โ ideate โ score โ plan |
| ๐ข Modal ($20k credits) | offline crawl+embed + LoRA training on Modal (build-time, separated from runtime) |
| ๐ฌ Best Demo ($1k) | the mandatory demo video, made to sing (shared artifact + wow beat) |
| ๐ OpenAI ($10k) | auto-entered ("across all submissions"); free lottery ticket, not a target |
| โค๏ธ Community Choice ($2k) | shareable tweetable artifact from the experience |
**6 badges** = Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes. Awards stack across
categories. Single-winner awards (Tiny Titan, Best Agent, Off-Brand, Best Demo) are eligibility โ win โ the shared
lever is ยง11 custom-UI polish.
---
## 13. Risks / open items
1. **Deployment smoke tests are mandatory:** ZeroGPU Space build, same-origin NDJSON browser streaming, and Nemotron
batch ASR in `@spaces.GPU` must be verified after every runtime dependency change.
2. **EmbeddingGemma is gated** โ accept Gemma terms + `HF_TOKEN` before any crawl/build.
3. **MiniCPM5 tool-call reliability at 1B** โ covered by the degradation ladder (ยง8); validate name+args in code.
4. **Concept skin** โ **chosen: The Unwritten Almanac** (ยง2). Make-or-break is the bleed/bloom hero animation; build the
safe typewriter + static-stamp floor first (graceful degradation), upgrade ink last. Watch the thin-org echo
threshold + the dry-but-benevolent tone (real cited Spaces, never punch at a named builder).
5. **Param-budget claim** โ document the 1.98B total in the README/Space card for Tiny Titan judging.
---
## 14. Build order
**Text-first vertical slice first; voice input is now part of the app.** Always keep a demoable artifact.
0. **Day-1 spikes** (ยง1) โ get the three go/no-go builds green.
1. **`crawler.py` + Modal index** โ crawl the org, embed with EmbeddingGemma, build the local index. *You immediately
see what everyone's building and where the whitespace is.*
2. **`tools.py`** โ research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
3. **`agent.py`** โ 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
4. **`app.py`** โ `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
first-party `/api/...` endpoints; concept skin applied.
5. **Well-Tuned LoRA** โ small fine-tune on Modal โ publish to Hub (โ 6/6 badges).
6. **Voice input** โ push-to-talk record and voice-note upload through Nemotron batch ASR in `/api/transcribe`.
7. **Polish + submission** โ demo video + social post (Best Demo / Community Choice), publish agent trace (๐ก),
write up Field Notes (๐).
**Deferred:** real-time streaming ASR and turn detection. The shipped path stays batch audio โ transcript โ editable idea.
---
## 15. Sources
**Models:** [nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) ยท
[MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) ยท [MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF) ยท
[embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)
**Platforms:** [ZeroGPU docs](https://huggingface.co/docs/hub/spaces-zerogpu) ยท
[Introducing gradio.Server](https://huggingface.co/blog/introducing-gradio-server) ยท [Gradio Server Mode guide](https://www.gradio.app/guides/server-mode) ยท
[Modal GPU](https://modal.com/docs/guide/gpu) ยท [Modal model weights](https://modal.com/docs/guide/model-weights) ยท [Modal pricing](https://modal.com/pricing) ยท
[Build Small Hackathon](https://huggingface.co/build-small-hackathon)
*Verify-before-ship: Nemotron-in-ZeroGPU after dependency changes; MiniCPM5 license on the live card; llama.cpp MiniCPM5
tool-calling remains planned only and is not used by the deployed brain.*
|