hackathon-advisor / DESIGN.md
JacobLinCool's picture
fix: stabilize llama embedding runtime
ca766b5 verified
# Build Small Hackathon Advisor โ€” Design & Implementation Notes
> A **small-model agent** with text and voice input that investigates what other people have already built
> for the [Build Small Hackathon](https://huggingface.co/build-small-hackathon) and brainstorms an original new design
> *with you*. Output = streaming text + live visuals (no TTS). All models small, open-weight, run locally.
>
> The literal "advisor" is the **engine**; the user-facing experience is **The Unwritten Almanac** โ€” Mothback, an
> owl-moth archivist, keeps the Wood's book of fates and divines you a still-unwritten project page (ink **bleeds +
> cites real Spaces** if you overlap, **blooms gold** if it's new). This project is itself a Build Small submission
> (hack window 2026-06-05 โ†’ 2026-06-15).
---
## 1. Locked decisions & review corrections (2026-06-07)
A multi-agent adversarial review (5 dimensions, web-verified) set the direction below. **This section is the
authoritative decision log; the rest of the doc is written to be consistent with it.**
**Locked decisions (Jacob):**
1. **Concept = The Unwritten Almanac** (chosen 2026-06-07 from a 12-concept brainstorm). Mothback the owl-moth
archivist divines a fate-page; ink **bleeds and cites the real Spaces** you overlap (page 47, page 112โ€ฆ), or
**blooms gold + sprouts a leaf** when it's unwritten. Engine unchanged underneath (crawl โ†’ whitespace/originality โ†’
score). The dry "advisor" stays under the hood. Full spec + de-risking grafts in ยง2.
2. **Text-first with voice input.** The core workflow remains typed/editable text. Voice records or uploads a note,
transcribes it with batch ASR, and places the transcript in the same idea box. Real-time streaming + in-browser turn
detection are **deferred**.
3. **Add a ๐ŸŽฏ Well-Tuned fine-tune** โ€” a small LoRA (MiniCPM5 advisor persona / tool-calling), trained on Modal,
published to the Hub โ†’ 6/6 badges โ†’ strong shot at ๐ŸŽ–๏ธ Bonus Quest Champion ($2,000).
4. **ASR = Nemotron batch.** `nvidia/nemotron-speech-streaming-en-0.6b` runs through NVIDIA NeMo in a ZeroGPU function.
Audio is normalized to mono WAV before calling `transcribe([wav])`.
**Verified corrections:**
- **Drop SGLang.** It needs a persistent GPU process โ†’ incompatible with ZeroGPU (same root cause as vLLM). Run
MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
- **gr.Server custom UI streaming IS shipped** (the launch blog only deferred the *explanation*). The deployed browser
UI calls our own same-origin `/api/agent-turn` NDJSON stream with `fetch`; `_engine_turn` itself is wrapped in
`@spaces.GPU`, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The `@app.api("/agent_turn")` generator stays
available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN
`@gradio/client` path after real Space testing showed that browser turn could hang while the backend completed.
- **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") โ†’ auto-entered; a free
lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
- **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 โ†’ 6/6.
- **Tiny Titan** = "best โ‰ค4B model"; our largest single model is MiniCPM5 (1.08B), total stack ~1.9B โ†’ eligible.
**New build requirements surfaced by the review (designed into the sections below):**
- **Jargon alias layer (ยง7):** a 0.6B ASR mistranscribes our own vocab (Nemotron, MiniCPM5, EmbeddingGemma, ZeroGPUโ€ฆ).
Deterministic code-side fuzzy/alias map over our small CLOSED vocab, applied before any tool call and before display.
Surface "heard: neutron โ†’ Nemotron" as a delightful trust moment. (Active once voice is added.)
- **Tool-call degradation ladder (ยง8):** the 1B brain WILL emit broken tool calls (MiniCPM5-1B has a documented
"broken tool calling" report). Wrap parse in try/except, retry once at low temp, validate name+args vs JSON-Schema in
code (reject-and-repair), canned lines for empty results, a token **watchdog** that shows "trying again" instead of
dead air (the screen is the only feedback channel โ€” no TTS).
- **Latency / optimistic UI (ยง9/ยง11):** ZeroGPU cold start + 1B generation = seconds of potential dead air. Optimistic
UI on submit, pre-animate the project wall, set a latency budget. (The torch.compile cold-start penalty does NOT
apply โ€” we don't use it.)
**Day-1 go/no-go spikes (before any feature work):**
- Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
- `gr.Server` minimal: static `index.html` + one same-origin `/api/agent-turn` NDJSON stream, plus the retained
`@app.api()` generator for external clients, on the real ZeroGPU Space.
- Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4).
---
## 2. Concept โ€” The Unwritten Almanac (text-first)
The engine, regardless of skin:
1. **Investigate** the `build-small-hackathon` HF org โ€” what Spaces exist, which models, what's saturated, and where
the **whitespace** is โ€” using a local EmbeddingGemma index.
2. **Brainstorm** with the user: propose ideas, **score** them against a fixed rubric (originality vs. existing
projects, delight, AI-necessity, feasibility, param budget, prize-fit), and maintain an **idea board**.
3. **Respond** as streaming text + live visuals in a custom `gr.Server` frontend (no TTS โ€” the visual is the "voice").
**The skin (chosen): The Unwritten Almanac.** **Mothback**, a dusty owl-moth archivist, keeps the Wood's *book of
fates*. Every project already built in the org is an inked page; she divines you a destined entry on a still-blank page,
the ink writing itself live.
**The two-beat wow (this IS the engine, rendered):**
- You type one line about yourself / your idea. Inked pages riffle past (each = a real crawled Space).
- **Bleed:** if your idea overlaps existing work, the ink **seeps blood-red** and cites the exact real Spaces โ€” "the
Wood already wrote this, on page 47 and page 112" (= `get_project` overlap on the top retrieval hits). The burn is
**factual**, so it can't fall flat the way a 1B's invented joke can.
- **Bloom:** you say "write bolder"; the next entry flows **gold**, a green leaf sprouts โ€” "this page has never been
inked" (= a `find_whitespace` gold candidate).
- A **wax seal** presses in, lighting five quadrants as the idea qualifies (= `score_idea`: Originality, Delight,
AI-Necessity, Feasibility, Prize-Fit).
**Engine โ†” skin mapping:** `search_projects`/`get_project` overlap โ†’ the bleed + citations; `find_whitespace` โ†’ the
blank/gold pages; `score_idea` โ†’ the wax-seal quadrants; `save_idea` โ†’ the written fate-page; agent persona =
**Mothback** (Layer A system prompt + the ๐ŸŽฏ Well-Tuned LoRA = her voice).
**Shareable artifact (Community Choice):** the page exports as a PNG that looks **torn from an ancient grimoire** โ€”
aged parchment, a coined fate-name as title, the self-written prophecy, the five-quadrant seal, and a verdict stamp
(**"UNWRITTEN ยท 0 echoes"** vs **"ECHO ร—3"**). Built-in caption: "Mothback inked my fate page for #BuildSmall โ€”
UNWRITTEN." People compile draws into a "chapter" and dare friends to get a page that doesn't bleed.
**Grafted de-risking (from runner-up concepts):**
- **Tone = dry-but-benevolent** (Roastleaf's whiplash): the bleed-citation gently stings, the gold-bloom is sincerely
delighted; the burn is true-by-construction (real cited Spaces).
- **Templated structure (key risk-killer):** bank entry/roast templates (citation + dry verdict + redemptive branch);
the 1B only fills in real Space titles + the idea โ€” **never improvises whole comedy**.
- **Latin-binomial fate-names** (e.g. "Ludus Vocalis Infantium") via templated scaffolds โ€” built-in wit, backstops a
1B that might produce corny names.
- **"You vs the Wood" margin glyph:** a tiny cluster-dot thumbnail on the page showing your gold page among the inked
crowd โ€” cheap SVG, visual PROOF the gap is real.
- **Thin-org mitigation (load-bearing):** precompute whitespace clusters at Modal build-time and pin several DISTINCT
blank-page candidates so "write bolder" always lands on a real, varied gap (the org may be only ~30โ€“60 Spaces). Tune
the echo threshold toward *more frequent bleed* so the demo always has its "low" before the "wow".
**Defaults (revisit if time):** single-page artifact first (chapter compiler later); page-numbers visible, real titles
on hover (keep the burn aimed at the idea, not a named builder); seal animation = safe typewriter + static-stamp floor
first, bespoke ink-reveal last. Voice input is batch ASR that fills the same idea box before the user presses Ink.
Input is **text-first**; the experience is fully delightful with typed input alone.
AI is genuinely **load-bearing**: embeddings power the whitespace/originality analysis and the LLM drives the
investigate โ†’ ideate โ†’ score loop โ€” the experience collapses without the models (supports ๐Ÿค– Best Agent + TTW
"AI necessity").
---
## 3. Model stack (confirmed exact repo IDs)
| Role | Model | Params | Runtime | License | Prize hook |
|---|---|---|---|---|---|
| STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | ๐ŸŸฉ NVIDIA Nemotron Quest |
| LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | ๐Ÿฎ OpenBMB |
| Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | ๐Ÿ”Œ Off the Grid ยท ๐Ÿฆ™ Llama Champion ยท ๐ŸŸข Modal |
| Fine-tune | LoRA on MiniCPM5 โ†’ published to Hub | โ€” | PEFT / HF Jobs | โ€” | ๐ŸŽฏ Well-Tuned |
**Total โ‰ˆ 1.98B params โ†’ โ‰ค4B โ†’ ๐Ÿœ Tiny Titan eligible.** All open-weight, all runnable locally โ†’ ๐Ÿ”Œ Off the Grid.
> Naming: "OpenCPM5 1B" = `openbmb/MiniCPM5-1B` (MiniCPM 5.0, ~May 2026). "EmbeddingGemma 270M" =
> `google/embeddinggemma-300m` (308M total; 270M = non-embedding transformer params). **SGLang dropped** (ZeroGPU
> incompatible). STT is used in **batch voice-note** mode, not a persistent stream.
---
## 4. Deployment & architecture (single path)
With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B tension dissolves โ€” there is one path:
- **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
- **Text-first runtime loop:** user types โ†’ custom `/api/agent-turn` NDJSON endpoint โ†’ one `@spaces.GPU` call runs
MiniCPM5 (tool loop, in `transformers`) โ†’ streamed text tokens + live visual updates. The `@app.api()` endpoint
remains as the Gradio-client contract for external checks.
- **Voice input:** push-to-talk records an utterance or uploads a voice note โ†’ `/api/transcribe` normalizes audio with
ffmpeg โ†’ one `@spaces.GPU` call runs Nemotron ASR through NeMo โ†’ transcript fills the idea box. No persistent stream,
no WebRTC, **no TURN server**.
- **Modal (build-time only):** crawl the org + build the llama.cpp EmbeddingGemma vector index offline; the Space ships
with checked-in project vectors. Runtime never calls Modal โ†’ ๐Ÿ”Œ Off the Grid holds (see ยง10).
> Off the Grid = no proprietary cloud inference APIs. Open weights on an HF GPU Space / local box / Modal all qualify.
**Deferred:** real-time streaming ASR and turn detection are not part of the shipped app.
---
## 5. Per-model implementation notes
### 5.1 ASR โ€” `nvidia/nemotron-speech-streaming-en-0.6b` (batch)
- **Primary, batch usage (simple):**
```python
import nemo.collections.asr as nemo_asr
asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-speech-streaming-en-0.6b")
text = asr.transcribe(["utterance.wav"]) # 16 kHz mono WAV in; punctuated EN text out
```
Runtime install: `packages.txt` provides `ffmpeg` and `libsndfile1`; `requirements.txt` pins
`nemo_toolkit[asr]==2.7.3` plus Cython and packaging. The app records or uploads audio, normalizes it to mono
16 kHz WAV, runs NeMo in a ZeroGPU function, then returns the transcript to the idea box. Hosted NVIDIA NIM API would
break Off the Grid, so it is not used.
### 5.2 MiniCPM5-1B brain โ€” `openbmb/MiniCPM5-1B` (transformers, self-parsed XML)
- Context 128K, bilingual (EN/ZH), Apache-2.0. `enable_thinking=False`, `temperature=0.7, top_p=0.95` for fast tool calls.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B", torch_dtype="auto", device_map="auto")
inputs = tok.apply_chat_template(messages, tools=TOOLS, add_generation_prompt=True, enable_thinking=False,
tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
```
- **Tool calling:** pass JSON-Schema tools via the chat template `tools=` arg; the model emits **XML**
`<function name="get_weather">{"city":"New York"}</function>`. **Parse this ourselves** (SGLang dropped). Wrap parse
in try/except and validate against the schema โ€” see the degradation ladder (ยง8).
- **Local / CPU & llama.cpp (Off the Grid ยท Llama Champion):** `openbmb/MiniCPM5-1B-GGUF:Q4_K_M` (688 MB) via llama.cpp
or Ollama (CPU-viable). fp16 โ‰ˆ 3โ€“4 GB VRAM. `openbmb/MiniCPM5-1B-MLX` for Apple Silicon. (llama.cpp MiniCPM5
tool-calling is a pending PR โ€” verify before relying on it for the badge runtime.)
- **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
### 5.4 EmbeddingGemma GGUF โ€” `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`
- Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
- Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
- Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
over checked-in project vectors.
- Evidence is recorded in index metadata: model repo, GGUF filename, runtime, dimensions, build source, builder script,
llama-cpp-python version, and Modal app name.
### 5.5 llama.cpp support (๐Ÿฆ™ Llama Champion)
The active Llama Champion path is the retrieval model: the project index is built with EmbeddingGemma GGUF through
llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
| Model | llama.cpp? | Runtime | Notes |
|---|---|---|---|
| `openbmb/MiniCPM5-1B` | โœ… planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
| `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | โœ… active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
| ASR (Nemotron) | โŒ | NeMo | FastConformer-RNNT |
The checked-in index and runtime query embedder must stay on the same GGUF file.
---
## 6. Agent context design (built for a 1B brain)
Core principle: **the 1B model is a router + arg-filler. All heavy work (crawl, summarize, score, rank, dedup) lives in
code.** Keep live context to ~800โ€“1200 tokens of *curated* view, never raw data.
- **Layer A โ€” System (static, ~250 tok):** identity/character; hackathon hard rules (โ‰ค32B, Gradio Space, demo video) so
it self-filters infeasible ideas; targeted prizes (biases ideation); reply style (short, one question at a time);
explicit tool-use instructions + the canonical jargon vocabulary (so it can self-correct, ยง7).
- **Layer B โ€” Session state (re-rendered each turn by code, ~300 tok):** user profile; locked decisions (track, side
quests, models); **idea board** (2โ€“3 candidates, one line + scores); compact "projects already seen" summary.
- **Layer C โ€” Ephemeral (~300 tok):** last 2โ€“3 turns; the most recent tool result as a **refined card** (not raw JSON).
---
## 7. Agent tool design
Few tools, few params each, short descriptions (1B-friendly). Heavy logic in code.
**Jargon alias layer (input normalization).** Before any tool call and before display, run ASR/user text through a
deterministic fuzzy/alias map over our small CLOSED vocab (model names and goal names) โ€” e.g. RapidFuzz
`token_set_ratio` / double-metaphone โ€” mapping "neutron"/"nemo tron" โ†’ Nemotron, "mini cpm" โ†’ MiniCPM5, "zero gpu" โ†’
ZeroGPU. Surface the correction ("heard: neutron โ†’ Nemotron") as a trust-building, slightly delightful moment.
**Research โ€” investigate existing projects (the core value).** Data = `build-small-hackathon` org Spaces, pre-crawled
into a local snapshot + EmbeddingGemma index (keeps Off the Grid at runtime).
| Tool | Signature | Returns (refined) | Heavy work |
|---|---|---|---|
| `list_projects` | `(track?, sort?)` | top-N project cards | HF Hub API + summarize |
| `search_projects` | `(query)` | top 5 cards | EmbeddingGemma retrieval |
| `get_project` | `(id)` | card + overlap-vs-board verdict | code computes overlap |
| `find_whitespace` | `()` | under-explored niches | cluster the index, find gaps |
`find_whitespace` is the originality engine (TTW judges originality) โ€” it names where nobody has built yet.
**Ideation / state.**
| Tool | Signature | Purpose |
|---|---|---|
| `save_idea` | `(title, pitch)` | add/update a candidate on the idea board |
| `score_idea` | `(id)` | fixed (hardcoded) rubric โ†’ scores + gaps; the 1B only triggers + verbalizes |
| `compare_ideas` | `()` | rank the board, articulate tradeoffs |
| `make_plan` | `(id)` | build plan + goals the current direction can support |
| `update_profile` | `(field, value)` | record skills/time/prefs โ†’ Layer B |
| `set_goals` | `(goals[])` | change selected goals โ†’ updates Layer A bias |
---
## 8. Agent loop (single-hop + degradation ladder)
```
on user input (text; or voice โ†’ batch ASR โ†’ text):
normalize via jargon alias layer
ctx = LayerA + render_state(LayerB) + last_turns + last_tool_card
out = MiniCPM5(ctx, tools=TOOLS, enable_thinking=False, temp=0.7) # โ†’ tool_call | reply
try: parse XML tool call
except / invalid name|args (vs JSON-Schema): # degradation ladder
retry once (tempโ‰ˆ0.3, "emit ONLY one valid tool call")
still bad โ†’ run a safe default tool (find_whitespace) so the screen never freezes
if tool_call: card = run_tool(out); reply = MiniCPM5(ctx + card) # single follow-up, no long ReAct
empty/zero result โ†’ canned advisor line (never say nothing)
stream reply tokens โ†’ custom UI | token watchdog: no token in N s โ†’ "trying again" visual (not dead air)
update_state(LayerB)
```
**Max one tool-call then reply.** A 1B can't sustain multi-step ReAct; wrap multi-step flows (`search โ†’ get_project โ†’
score`) into one *code* "research" action the model calls once. The degradation ladder is a **first-class UX surface**
(ยง11), not an error branch โ€” the screen is the only feedback channel (no TTS).
---
## 9. ZeroGPU deployment notes
- `import spaces; @spaces.GPU(duration=โ€ฆ)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
- Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day โ†’ never idle-hold the GPU.
- **Frontend โ†’ backend via same-origin `fetch("/api/agent-turn")`** reading NDJSON from our FastAPI route. The GPU
boundary is `_engine_turn`, decorated with `@spaces.GPU`; `@app.api()` endpoints remain available for Gradio-client
tests and external callers.
- All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.
---
## 10. Modal โ€” offline pipeline (build-time only โ†’ preserves Off the Grid)
Modal = build-time; runtime never calls it. This is how the app claims **both** ๐ŸŸข Modal and ๐Ÿ”Œ Off the Grid. The
canonical command is:
```bash
.venv/bin/modal run scripts/modal_build_project_index.py \
--projects data/projects.json \
--out data/project_index.json
```
The remote function installs `llama-cpp-python`, downloads
`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
index at `2026-06-07T08:16:19+00:00`.
---
## 11. Frontend โ€” `gr.Server` custom UI (๐ŸŽจ Off-Brand)
No TTS โ†’ the **visual output is the agent's "voice"**; it must carry the delight (this is what earns Off-Brand, and the
TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (ยง2): a candlelit tree-hollow with a heavy
open grimoire as the hero component.
- `gradio.Server` is a FastAPI subclass serving **your own frontend** while still exposing `@app.api(name=...)`
functions for Gradio/Python clients. The visible app uses first-party `@app.post()` endpoints for deterministic
browser behavior; the GPU boundary stays in the decorated engine function.
```python
from gradio import Server
from fastapi.responses import HTMLResponse
app = Server()
@app.api(name="agent_turn", concurrency_limit=2)
async def agent_turn(message: str):
for token in run_agent_stream(message): # generator โ†’ SSE
yield token
@app.get("/", response_class=HTMLResponse) # custom UI replaces Gradio's default page
async def home(): return open("index.html").read()
app.launch()
```
- Frontend calls via `fetch("/api/agent-turn")`, parses newline-delimited JSON events, and updates the grimoire as
`start` / `token` / `done` messages arrive. Notes and chapter exports use `/api/field-notes` and `/api/chapter`.
- **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
tokens); `search_projects`/overlap โ†’ **bleed** animation + page-number citations (real titles on hover);
`find_whitespace` โ†’ **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
`score_idea` โ†’ **wax-seal** five-quadrant stamp; the riffling inked pages (fast page-flip of real Spaces) double as
the project-wall; export = the torn-grimoire PNG artifact (ยง2). Jargon-correction toasts (ยง7) read as Mothback's
margin notes; optimistic-UI loading + watchdog states (ยง8) are her "the page is choosing its wordsโ€ฆ". Cheap SFX:
page-flip, quill scratch, wax-seal thunk.
- **Build the animation floor first:** safe typewriter + static stamp ships first (graceful degradation โ€” the judges
credited this); upgrade the ink-bleed / gold-bloom / seal-press last.
- **Fallback:** the backend (`tools.py`/`agent.py`) is UI-agnostic โ€” if gr.Server misbehaves, fall back to
`gr.Blocks` + `gr.HTML`, losing only the $1500 Off-Brand badge, never the submission.
---
## 12. Prize mapping
| Target | How it's earned |
|---|---|
| ๐Ÿ„ Thousand Token Wood | **The Unwritten Almanac** (ยง2) โ€” the bleed-citation wow IS the engine rendered; AI load-bearing; original |
| ๐Ÿœ Tiny Titan (special, $1.5k) | total ~1.98B, every model โ‰ค4B; largest single = MiniCPM5 1.08B |
| ๐Ÿ”Œ Off the Grid (badge) | all open weights run locally; offline index; no cloud inference at runtime |
| ๐ŸŽฏ Well-Tuned (badge) | published LoRA fine-tune of MiniCPM5 on the Hub (ยง10) โ†’ **6/6 badges** |
| ๐ŸŽจ Off-Brand (badge + $1.5k) | `gr.Server` custom UI is the agent's output surface |
| ๐Ÿฎ OpenBMB ($10k) | brain = MiniCPM5-1B ("OpenBMB pick") |
| ๐ŸŸฉ NVIDIA Quest (2ร— RTX 5080) | ASR = Nemotron (ยง5.1) |
| ๐Ÿฆ™ Llama Champion (badge) | EmbeddingGemma GGUF retrieval index and runtime query embeddings run through llama.cpp (ยง5.5) |
| ๐Ÿ“ก Sharing is Caring (badge) | publish the agent's tool-call trace to the Hub |
| ๐Ÿ““ Field Notes (badge) | this DESIGN.md โ†’ a build blog post |
| ๐ŸŽ–๏ธ Bonus Quest Champion ($2k) | 6/6 badges (needs the Well-Tuned fine-tune) |
| ๐Ÿค– Best Agent ($1k) | real multi-tool loop: investigate โ†’ ideate โ†’ score โ†’ plan |
| ๐ŸŸข Modal ($20k credits) | offline crawl+embed + LoRA training on Modal (build-time, separated from runtime) |
| ๐ŸŽฌ Best Demo ($1k) | the mandatory demo video, made to sing (shared artifact + wow beat) |
| ๐ŸŒ€ OpenAI ($10k) | auto-entered ("across all submissions"); free lottery ticket, not a target |
| โค๏ธ Community Choice ($2k) | shareable tweetable artifact from the experience |
**6 badges** = Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes. Awards stack across
categories. Single-winner awards (Tiny Titan, Best Agent, Off-Brand, Best Demo) are eligibility โ‰  win โ€” the shared
lever is ยง11 custom-UI polish.
---
## 13. Risks / open items
1. **Deployment smoke tests are mandatory:** ZeroGPU Space build, same-origin NDJSON browser streaming, and Nemotron
batch ASR in `@spaces.GPU` must be verified after every runtime dependency change.
2. **EmbeddingGemma is gated** โ€” accept Gemma terms + `HF_TOKEN` before any crawl/build.
3. **MiniCPM5 tool-call reliability at 1B** โ€” covered by the degradation ladder (ยง8); validate name+args in code.
4. **Concept skin** โ€” **chosen: The Unwritten Almanac** (ยง2). Make-or-break is the bleed/bloom hero animation; build the
safe typewriter + static-stamp floor first (graceful degradation), upgrade ink last. Watch the thin-org echo
threshold + the dry-but-benevolent tone (real cited Spaces, never punch at a named builder).
5. **Param-budget claim** โ€” document the 1.98B total in the README/Space card for Tiny Titan judging.
---
## 14. Build order
**Text-first vertical slice first; voice input is now part of the app.** Always keep a demoable artifact.
0. **Day-1 spikes** (ยง1) โ€” get the three go/no-go builds green.
1. **`crawler.py` + Modal index** โ€” crawl the org, embed with EmbeddingGemma, build the local index. *You immediately
see what everyone's building and where the whitespace is.*
2. **`tools.py`** โ€” research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
3. **`agent.py`** โ€” 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
4. **`app.py`** โ€” `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
first-party `/api/...` endpoints; concept skin applied.
5. **Well-Tuned LoRA** โ€” small fine-tune on Modal โ†’ publish to Hub (โ†’ 6/6 badges).
6. **Voice input** โ€” push-to-talk record and voice-note upload through Nemotron batch ASR in `/api/transcribe`.
7. **Polish + submission** โ€” demo video + social post (Best Demo / Community Choice), publish agent trace (๐Ÿ“ก),
write up Field Notes (๐Ÿ““).
**Deferred:** real-time streaming ASR and turn detection. The shipped path stays batch audio โ†’ transcript โ†’ editable idea.
---
## 15. Sources
**Models:** [nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) ยท
[MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) ยท [MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF) ยท
[embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)
**Platforms:** [ZeroGPU docs](https://huggingface.co/docs/hub/spaces-zerogpu) ยท
[Introducing gradio.Server](https://huggingface.co/blog/introducing-gradio-server) ยท [Gradio Server Mode guide](https://www.gradio.app/guides/server-mode) ยท
[Modal GPU](https://modal.com/docs/guide/gpu) ยท [Modal model weights](https://modal.com/docs/guide/model-weights) ยท [Modal pricing](https://modal.com/pricing) ยท
[Build Small Hackathon](https://huggingface.co/build-small-hackathon)
*Verify-before-ship: Nemotron-in-ZeroGPU after dependency changes; MiniCPM5 license on the live card; llama.cpp MiniCPM5
tool-calling remains planned only and is not used by the deployed brain.*