# Build Small Hackathon Advisor — Design & Implementation Notes

> A **small-model agent** with text and voice input that investigates what other people have already built
> for the [Build Small Hackathon](https://huggingface.co/build-small-hackathon) and brainstorms an original new design
> *with you*. Output = streaming text + live visuals (no TTS). All models small, open-weight, run locally.
>
> The literal "advisor" is the **engine**; the user-facing experience is **The Unwritten Almanac** — Mothback, an
> owl-moth archivist, keeps the Wood's book of fates and divines you a still-unwritten project page (ink **bleeds +
> cites real Spaces** if you overlap, **blooms gold** if it's new). This project is itself a Build Small submission
> (hack window 2026-06-05 → 2026-06-15).

---

## 1. Locked decisions & review corrections (2026-06-07)

A multi-agent adversarial review (5 dimensions, web-verified) set the direction below. **This section is the
authoritative decision log; the rest of the doc is written to be consistent with it.**

**Locked decisions (Jacob):**
1. **Concept = The Unwritten Almanac** (chosen 2026-06-07 from a 12-concept brainstorm). Mothback the owl-moth
   archivist divines a fate-page; ink **bleeds and cites the real Spaces** you overlap (page 47, page 112…), or
   **blooms gold + sprouts a leaf** when it's unwritten. Engine unchanged underneath (crawl → whitespace/originality →
   score). The dry "advisor" stays under the hood. Full spec + de-risking grafts in §2.
2. **Text-first with voice input.** The core workflow remains typed/editable text. Voice records or uploads a note,
   transcribes it with batch ASR, and places the transcript in the same idea box. Real-time streaming + in-browser turn
   detection are **deferred**.
3. **Add a 🎯 Well-Tuned fine-tune** — a small LoRA (MiniCPM5 advisor persona / tool-calling), trained on Modal,
   published to the Hub → 6/6 badges → strong shot at 🎖️ Bonus Quest Champion ($2,000).
4. **ASR = Nemotron batch.** `nvidia/nemotron-speech-streaming-en-0.6b` runs through NVIDIA NeMo in a ZeroGPU function.
   Audio is normalized to mono WAV before calling `transcribe([wav])`.

**Verified corrections:**
- **Drop SGLang.** It needs a persistent GPU process → incompatible with ZeroGPU (same root cause as vLLM). Run
  MiniCPM5 via plain `transformers` inside `@spaces.GPU` and parse its XML tool calls in our own code.
- **gr.Server custom UI streaming IS shipped** (the launch blog only deferred the *explanation*). The deployed browser
  UI calls our own same-origin `/api/agent-turn` NDJSON stream with `fetch`; `_engine_turn` itself is wrapped in
  `@spaces.GPU`, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The `@app.api("/agent_turn")` generator stays
  available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN
  `@gradio/client` path after real Space testing showed that browser turn could hang while the backend completed.
- **OpenAI Track has NO model requirement** ("OpenAI's own podium across all submissions") → auto-entered; a free
  lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
- **Badges = 6 total** (Tiny Titan is a $1.5k *special award*, not a badge). Decision #3 takes us from 5/6 → 6/6.
- **Tiny Titan** = "best ≤4B model"; our largest single model is MiniCPM5 (1.08B), total stack ~1.9B → eligible.

**New build requirements surfaced by the review (designed into the sections below):**
- **Jargon alias layer (§7):** a 0.6B ASR mistranscribes our own vocab (Nemotron, MiniCPM5, EmbeddingGemma, ZeroGPU…).
  Deterministic code-side fuzzy/alias map over our small CLOSED vocab, applied before any tool call and before display.
  Surface "heard: neutron → Nemotron" as a delightful trust moment. (Active once voice is added.)
- **Tool-call degradation ladder (§8):** the 1B brain WILL emit broken tool calls (MiniCPM5-1B has a documented
  "broken tool calling" report). Wrap parse in try/except, retry once at low temp, validate name+args vs JSON-Schema in
  code (reject-and-repair), canned lines for empty results, a token **watchdog** that shows "trying again" instead of
  dead air (the screen is the only feedback channel — no TTS).
- **Latency / optimistic UI (§9/§11):** ZeroGPU cold start + 1B generation = seconds of potential dead air. Optimistic
  UI on submit, pre-animate the project wall, set a latency budget. (The torch.compile cold-start penalty does NOT
  apply — we don't use it.)

**Day-1 go/no-go spikes (before any feature work):**
- Trivial `@spaces.GPU` hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
- `gr.Server` minimal: static `index.html` + one same-origin `/api/agent-turn` NDJSON stream, plus the retained
  `@app.api()` generator for external clients, on the real ZeroGPU Space.
- Nemotron `nemo_toolkit[asr]` install + one batch `transcribe()` inside `@spaces.GPU` (decision #4).

---

## 2. Concept — The Unwritten Almanac (text-first)

The engine, regardless of skin:

1. **Investigate** the `build-small-hackathon` HF org — what Spaces exist, which models, what's saturated, and where
   the **whitespace** is — using a local EmbeddingGemma index.
2. **Brainstorm** with the user: propose ideas, **score** them against a fixed rubric (originality vs. existing
   projects, delight, AI-necessity, feasibility, param budget, prize-fit), and maintain an **idea board**.
3. **Respond** as streaming text + live visuals in a custom `gr.Server` frontend (no TTS — the visual is the "voice").

**The skin (chosen): The Unwritten Almanac.** **Mothback**, a dusty owl-moth archivist, keeps the Wood's *book of
fates*. Every project already built in the org is an inked page; she divines you a destined entry on a still-blank page,
the ink writing itself live.

**The two-beat wow (this IS the engine, rendered):**
- You type one line about yourself / your idea. Inked pages riffle past (each = a real crawled Space).
- **Bleed:** if your idea overlaps existing work, the ink **seeps blood-red** and cites the exact real Spaces — "the
  Wood already wrote this, on page 47 and page 112" (= `get_project` overlap on the top retrieval hits). The burn is
  **factual**, so it can't fall flat the way a 1B's invented joke can.
- **Bloom:** you say "write bolder"; the next entry flows **gold**, a green leaf sprouts — "this page has never been
  inked" (= a `find_whitespace` gold candidate).
- A **wax seal** presses in, lighting five quadrants as the idea qualifies (= `score_idea`: Originality, Delight,
  AI-Necessity, Feasibility, Prize-Fit).

**Engine ↔ skin mapping:** `search_projects`/`get_project` overlap → the bleed + citations; `find_whitespace` → the
blank/gold pages; `score_idea` → the wax-seal quadrants; `save_idea` → the written fate-page; agent persona =
**Mothback** (Layer A system prompt + the 🎯 Well-Tuned LoRA = her voice).

**Shareable artifact (Community Choice):** the page exports as a PNG that looks **torn from an ancient grimoire** —
aged parchment, a coined fate-name as title, the self-written prophecy, the five-quadrant seal, and a verdict stamp
(**"UNWRITTEN · 0 echoes"** vs **"ECHO ×3"**). Built-in caption: "Mothback inked my fate page for #BuildSmall —
UNWRITTEN." People compile draws into a "chapter" and dare friends to get a page that doesn't bleed.

**Grafted de-risking (from runner-up concepts):**
- **Tone = dry-but-benevolent** (Roastleaf's whiplash): the bleed-citation gently stings, the gold-bloom is sincerely
  delighted; the burn is true-by-construction (real cited Spaces).
- **Templated structure (key risk-killer):** bank entry/roast templates (citation + dry verdict + redemptive branch);
  the 1B only fills in real Space titles + the idea — **never improvises whole comedy**.
- **Latin-binomial fate-names** (e.g. "Ludus Vocalis Infantium") via templated scaffolds — built-in wit, backstops a
  1B that might produce corny names.
- **"You vs the Wood" margin glyph:** a tiny cluster-dot thumbnail on the page showing your gold page among the inked
  crowd — cheap SVG, visual PROOF the gap is real.
- **Thin-org mitigation (load-bearing):** precompute whitespace clusters at Modal build-time and pin several DISTINCT
  blank-page candidates so "write bolder" always lands on a real, varied gap (the org may be only ~30–60 Spaces). Tune
  the echo threshold toward *more frequent bleed* so the demo always has its "low" before the "wow".

**Defaults (revisit if time):** single-page artifact first (chapter compiler later); page-numbers visible, real titles
on hover (keep the burn aimed at the idea, not a named builder); seal animation = safe typewriter + static-stamp floor
first, bespoke ink-reveal last. Voice input is batch ASR that fills the same idea box before the user presses Ink.

Input is **text-first**; the experience is fully delightful with typed input alone.

AI is genuinely **load-bearing**: embeddings power the whitespace/originality analysis and the LLM drives the
investigate → ideate → score loop — the experience collapses without the models (supports 🤖 Best Agent + TTW
"AI necessity").

---

## 3. Model stack (confirmed exact repo IDs)

| Role | Model | Params | Runtime | License | Prize hook |
|---|---|---|---|---|---|
| STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
| LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
| Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
| Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |

**Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.

> Naming: "OpenCPM5 1B" = `openbmb/MiniCPM5-1B` (MiniCPM 5.0, ~May 2026). "EmbeddingGemma 270M" =
> `google/embeddinggemma-300m` (308M total; 270M = non-embedding transformer params). **SGLang dropped** (ZeroGPU
> incompatible). STT is used in **batch voice-note** mode, not a persistent stream.

---

## 4. Deployment & architecture (single path)

With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B tension dissolves — there is one path:

- **ZeroGPU Gradio-SDK Space** (free). GPU is attached only inside `@spaces.GPU` calls (default 60s, max ~120s,
  RTX Pro 6000 Blackwell, `large`=48 GB). Per-turn inference fits this model exactly.
- **Text-first runtime loop:** user types → custom `/api/agent-turn` NDJSON endpoint → one `@spaces.GPU` call runs
  MiniCPM5 (tool loop, in `transformers`) → streamed text tokens + live visual updates. The `@app.api()` endpoint
  remains as the Gradio-client contract for external checks.
- **Voice input:** push-to-talk records an utterance or uploads a voice note → `/api/transcribe` normalizes audio with
  ffmpeg → one `@spaces.GPU` call runs Nemotron ASR through NeMo → transcript fills the idea box. No persistent stream,
  no WebRTC, **no TURN server**.
- **Modal (build-time only):** crawl the org + build the llama.cpp EmbeddingGemma vector index offline; the Space ships
  with checked-in project vectors. Runtime never calls Modal → 🔌 Off the Grid holds (see §10).

> Off the Grid = no proprietary cloud inference APIs. Open weights on an HF GPU Space / local box / Modal all qualify.

**Deferred:** real-time streaming ASR and turn detection are not part of the shipped app.

---

## 5. Per-model implementation notes

### 5.1 ASR — `nvidia/nemotron-speech-streaming-en-0.6b` (batch)

- **Primary, batch usage (simple):**
  ```python
  import nemo.collections.asr as nemo_asr
  asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-speech-streaming-en-0.6b")
  text = asr.transcribe(["utterance.wav"])      # 16 kHz mono WAV in; punctuated EN text out
  ```
  Runtime install: `packages.txt` provides `ffmpeg` and `libsndfile1`; `requirements.txt` pins
  `nemo_toolkit[asr]==2.7.3` plus Cython and packaging. The app records or uploads audio, normalizes it to mono
  16 kHz WAV, runs NeMo in a ZeroGPU function, then returns the transcript to the idea box. Hosted NVIDIA NIM API would
  break Off the Grid, so it is not used.

### 5.2 MiniCPM5-1B brain — `openbmb/MiniCPM5-1B` (transformers, self-parsed XML)

- Context 128K, bilingual (EN/ZH), Apache-2.0. `enable_thinking=False`, `temperature=0.7, top_p=0.95` for fast tool calls.
  ```python
  from transformers import AutoModelForCausalLM, AutoTokenizer
  tok = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
  model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B", torch_dtype="auto", device_map="auto")
  inputs = tok.apply_chat_template(messages, tools=TOOLS, add_generation_prompt=True, enable_thinking=False,
                                   tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
  ```
- **Tool calling:** pass JSON-Schema tools via the chat template `tools=` arg; the model emits **XML**
  `<function name="get_weather">{"city":"New York"}</function>`. **Parse this ourselves** (SGLang dropped). Wrap parse
  in try/except and validate against the schema — see the degradation ladder (§8).
- **Local / CPU & llama.cpp (Off the Grid · Llama Champion):** `openbmb/MiniCPM5-1B-GGUF:Q4_K_M` (688 MB) via llama.cpp
  or Ollama (CPU-viable). fp16 ≈ 3–4 GB VRAM. `openbmb/MiniCPM5-1B-MLX` for Apple Silicon. (llama.cpp MiniCPM5
  tool-calling is a pending PR — verify before relying on it for the badge runtime.)
- **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.

### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`

- Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
- Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
- Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
  over checked-in project vectors.
- Evidence is recorded in index metadata: model repo, GGUF filename, runtime, dimensions, build source, builder script,
  llama-cpp-python version, and Modal app name.

### 5.5 llama.cpp support (🦙 Llama Champion)

The active Llama Champion path is the retrieval model: the project index is built with EmbeddingGemma GGUF through
llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.

| Model | llama.cpp? | Runtime | Notes |
|---|---|---|---|
| `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
| `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
| ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |

The checked-in index and runtime query embedder must stay on the same GGUF file.

---

## 6. Agent context design (built for a 1B brain)

Core principle: **the 1B model is a router + arg-filler. All heavy work (crawl, summarize, score, rank, dedup) lives in
code.** Keep live context to ~800–1200 tokens of *curated* view, never raw data.

- **Layer A — System (static, ~250 tok):** identity/character; hackathon hard rules (≤32B, Gradio Space, demo video) so
  it self-filters infeasible ideas; targeted prizes (biases ideation); reply style (short, one question at a time);
  explicit tool-use instructions + the canonical jargon vocabulary (so it can self-correct, §7).
- **Layer B — Session state (re-rendered each turn by code, ~300 tok):** user profile; locked decisions (track, side
  quests, models); **idea board** (2–3 candidates, one line + scores); compact "projects already seen" summary.
- **Layer C — Ephemeral (~300 tok):** last 2–3 turns; the most recent tool result as a **refined card** (not raw JSON).

---

## 7. Agent tool design

Few tools, few params each, short descriptions (1B-friendly). Heavy logic in code.

**Jargon alias layer (input normalization).** Before any tool call and before display, run ASR/user text through a
deterministic fuzzy/alias map over our small CLOSED vocab (model names and goal names) — e.g. RapidFuzz
`token_set_ratio` / double-metaphone — mapping "neutron"/"nemo tron" → Nemotron, "mini cpm" → MiniCPM5, "zero gpu" →
ZeroGPU. Surface the correction ("heard: neutron → Nemotron") as a trust-building, slightly delightful moment.

**Research — investigate existing projects (the core value).** Data = `build-small-hackathon` org Spaces, pre-crawled
into a local snapshot + EmbeddingGemma index (keeps Off the Grid at runtime).

| Tool | Signature | Returns (refined) | Heavy work |
|---|---|---|---|
| `list_projects` | `(track?, sort?)` | top-N project cards | HF Hub API + summarize |
| `search_projects` | `(query)` | top 5 cards | EmbeddingGemma retrieval |
| `get_project` | `(id)` | card + overlap-vs-board verdict | code computes overlap |
| `find_whitespace` | `()` | under-explored niches | cluster the index, find gaps |

`find_whitespace` is the originality engine (TTW judges originality) — it names where nobody has built yet.

**Ideation / state.**

| Tool | Signature | Purpose |
|---|---|---|
| `save_idea` | `(title, pitch)` | add/update a candidate on the idea board |
| `score_idea` | `(id)` | fixed (hardcoded) rubric → scores + gaps; the 1B only triggers + verbalizes |
| `compare_ideas` | `()` | rank the board, articulate tradeoffs |
| `make_plan` | `(id)` | build plan + goals the current direction can support |
| `update_profile` | `(field, value)` | record skills/time/prefs → Layer B |
| `set_goals` | `(goals[])` | change selected goals → updates Layer A bias |

---

## 8. Agent loop (single-hop + degradation ladder)

```
on user input (text; or voice → batch ASR → text):
  normalize via jargon alias layer
  ctx = LayerA + render_state(LayerB) + last_turns + last_tool_card
  out = MiniCPM5(ctx, tools=TOOLS, enable_thinking=False, temp=0.7)   # → tool_call | reply
  try: parse XML tool call
  except / invalid name|args (vs JSON-Schema):                         # degradation ladder
      retry once (temp≈0.3, "emit ONLY one valid tool call")
      still bad → run a safe default tool (find_whitespace) so the screen never freezes
  if tool_call: card = run_tool(out); reply = MiniCPM5(ctx + card)     # single follow-up, no long ReAct
  empty/zero result → canned advisor line (never say nothing)
  stream reply tokens → custom UI   |   token watchdog: no token in N s → "trying again" visual (not dead air)
  update_state(LayerB)
```

**Max one tool-call then reply.** A 1B can't sustain multi-step ReAct; wrap multi-step flows (`search → get_project →
score`) into one *code* "research" action the model calls once. The degradation ladder is a **first-class UX surface**
(§11), not an error branch — the screen is the only feedback channel (no TTS).

---

## 9. ZeroGPU deployment notes

- `import spaces; @spaces.GPU(duration=…)`. GPU only inside decorated fns; **Gradio-SDK Space only** (no Docker ZeroGPU).
- Load models at **module level**, `.to('cuda')` once (emulated until first real GPU call); real compute inside the
  decorator. torch 2.8+; **no `torch.compile`** (use AOT). Quota PRO ~40 min/day → never idle-hold the GPU.
- **Frontend → backend via same-origin `fetch("/api/agent-turn")`** reading NDJSON from our FastAPI route. The GPU
  boundary is `_engine_turn`, decorated with `@spaces.GPU`; `@app.api()` endpoints remain available for Gradio-client
  tests and external callers.
- All four models fit in `large` (48 GB). Keep each `@spaces.GPU` call short for queue priority.

---

## 10. Modal — offline pipeline (build-time only → preserves Off the Grid)

Modal = build-time; runtime never calls it. This is how the app claims **both** 🟢 Modal and 🔌 Off the Grid. The
canonical command is:

```bash
.venv/bin/modal run scripts/modal_build_project_index.py \
  --projects data/projects.json \
  --out data/project_index.json
```

The remote function installs `llama-cpp-python`, downloads
`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.

Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
index at `2026-06-07T08:16:19+00:00`.

---

## 11. Frontend — `gr.Server` custom UI (🎨 Off-Brand)

No TTS → the **visual output is the agent's "voice"**; it must carry the delight (this is what earns Off-Brand, and the
TTW polish + Best Demo score). The visual world is **The Unwritten Almanac** (§2): a candlelit tree-hollow with a heavy
open grimoire as the hero component.

- `gradio.Server` is a FastAPI subclass serving **your own frontend** while still exposing `@app.api(name=...)`
  functions for Gradio/Python clients. The visible app uses first-party `@app.post()` endpoints for deterministic
  browser behavior; the GPU boundary stays in the decorated engine function.
  ```python
  from gradio import Server
  from fastapi.responses import HTMLResponse
  app = Server()

  @app.api(name="agent_turn", concurrency_limit=2)
  async def agent_turn(message: str):
      for token in run_agent_stream(message):   # generator → SSE
          yield token

  @app.get("/", response_class=HTMLResponse)     # custom UI replaces Gradio's default page
  async def home(): return open("index.html").read()
  app.launch()
  ```
- Frontend calls via `fetch("/api/agent-turn")`, parses newline-delimited JSON events, and updates the grimoire as
  `start` / `token` / `done` messages arrive. Notes and chapter exports use `/api/field-notes` and `/api/chapter`.
- **UI surfaces (the grimoire is the canvas):** streaming reply = ink writing itself (typewriter on already-streaming
  tokens); `search_projects`/overlap → **bleed** animation + page-number citations (real titles on hover);
  `find_whitespace` → **gold bloom** + sprouting leaf + a one-shaft light-mask ("the page chooses you");
  `score_idea` → **wax-seal** five-quadrant stamp; the riffling inked pages (fast page-flip of real Spaces) double as
  the project-wall; export = the torn-grimoire PNG artifact (§2). Jargon-correction toasts (§7) read as Mothback's
  margin notes; optimistic-UI loading + watchdog states (§8) are her "the page is choosing its words…". Cheap SFX:
  page-flip, quill scratch, wax-seal thunk.
- **Build the animation floor first:** safe typewriter + static stamp ships first (graceful degradation — the judges
  credited this); upgrade the ink-bleed / gold-bloom / seal-press last.
- **Fallback:** the backend (`tools.py`/`agent.py`) is UI-agnostic — if gr.Server misbehaves, fall back to
  `gr.Blocks` + `gr.HTML`, losing only the $1500 Off-Brand badge, never the submission.

---

## 12. Prize mapping

| Target | How it's earned |
|---|---|
| 🍄 Thousand Token Wood | **The Unwritten Almanac** (§2) — the bleed-citation wow IS the engine rendered; AI load-bearing; original |
| 🐜 Tiny Titan (special, $1.5k) | total ~1.98B, every model ≤4B; largest single = MiniCPM5 1.08B |
| 🔌 Off the Grid (badge) | all open weights run locally; offline index; no cloud inference at runtime |
| 🎯 Well-Tuned (badge) | published LoRA fine-tune of MiniCPM5 on the Hub (§10) → **6/6 badges** |
| 🎨 Off-Brand (badge + $1.5k) | `gr.Server` custom UI is the agent's output surface |
| 🏮 OpenBMB ($10k) | brain = MiniCPM5-1B ("OpenBMB pick") |
| 🟩 NVIDIA Quest (2× RTX 5080) | ASR = Nemotron (§5.1) |
| 🦙 Llama Champion (badge) | EmbeddingGemma GGUF retrieval index and runtime query embeddings run through llama.cpp (§5.5) |
| 📡 Sharing is Caring (badge) | publish the agent's tool-call trace to the Hub |
| 📓 Field Notes (badge) | this DESIGN.md → a build blog post |
| 🎖️ Bonus Quest Champion ($2k) | 6/6 badges (needs the Well-Tuned fine-tune) |
| 🤖 Best Agent ($1k) | real multi-tool loop: investigate → ideate → score → plan |
| 🟢 Modal ($20k credits) | offline crawl+embed + LoRA training on Modal (build-time, separated from runtime) |
| 🎬 Best Demo ($1k) | the mandatory demo video, made to sing (shared artifact + wow beat) |
| 🌀 OpenAI ($10k) | auto-entered ("across all submissions"); free lottery ticket, not a target |
| ❤️ Community Choice ($2k) | shareable tweetable artifact from the experience |

**6 badges** = Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes. Awards stack across
categories. Single-winner awards (Tiny Titan, Best Agent, Off-Brand, Best Demo) are eligibility ≠ win — the shared
lever is §11 custom-UI polish.

---

## 13. Risks / open items

1. **Deployment smoke tests are mandatory:** ZeroGPU Space build, same-origin NDJSON browser streaming, and Nemotron
   batch ASR in `@spaces.GPU` must be verified after every runtime dependency change.
2. **EmbeddingGemma is gated** — accept Gemma terms + `HF_TOKEN` before any crawl/build.
3. **MiniCPM5 tool-call reliability at 1B** — covered by the degradation ladder (§8); validate name+args in code.
4. **Concept skin** — **chosen: The Unwritten Almanac** (§2). Make-or-break is the bleed/bloom hero animation; build the
   safe typewriter + static-stamp floor first (graceful degradation), upgrade ink last. Watch the thin-org echo
   threshold + the dry-but-benevolent tone (real cited Spaces, never punch at a named builder).
5. **Param-budget claim** — document the 1.98B total in the README/Space card for Tiny Titan judging.

---

## 14. Build order

**Text-first vertical slice first; voice input is now part of the app.** Always keep a demoable artifact.

0. **Day-1 spikes** (§1) — get the three go/no-go builds green.
1. **`crawler.py` + Modal index** — crawl the org, embed with EmbeddingGemma, build the local index. *You immediately
   see what everyone's building and where the whitespace is.*
2. **`tools.py`** — research + ideation tools + the hardcoded `score_idea` rubric + the jargon alias layer, over the index.
3. **`agent.py`** — 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via `transformers` (self-parsed XML).
4. **`app.py`** — `gr.Server` custom frontend (idea board, project/whitespace wall, streaming text), called via
   first-party `/api/...` endpoints; concept skin applied.
5. **Well-Tuned LoRA** — small fine-tune on Modal → publish to Hub (→ 6/6 badges).
6. **Voice input** — push-to-talk record and voice-note upload through Nemotron batch ASR in `/api/transcribe`.
7. **Polish + submission** — demo video + social post (Best Demo / Community Choice), publish agent trace (📡),
   write up Field Notes (📓).

**Deferred:** real-time streaming ASR and turn detection. The shipped path stays batch audio → transcript → editable idea.

---

## 15. Sources

**Models:** [nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) ·
[MiniCPM5-1B](https://huggingface.co/openbmb/MiniCPM5-1B) · [MiniCPM5-1B-GGUF](https://huggingface.co/openbmb/MiniCPM5-1B-GGUF) ·
[embeddinggemma-300m](https://huggingface.co/google/embeddinggemma-300m)

**Platforms:** [ZeroGPU docs](https://huggingface.co/docs/hub/spaces-zerogpu) ·
[Introducing gradio.Server](https://huggingface.co/blog/introducing-gradio-server) · [Gradio Server Mode guide](https://www.gradio.app/guides/server-mode) ·
[Modal GPU](https://modal.com/docs/guide/gpu) · [Modal model weights](https://modal.com/docs/guide/model-weights) · [Modal pricing](https://modal.com/pricing) ·
[Build Small Hackathon](https://huggingface.co/build-small-hackathon)

*Verify-before-ship: Nemotron-in-ZeroGPU after dependency changes; MiniCPM5 license on the live card; llama.cpp MiniCPM5
tool-calling remains planned only and is not used by the deployed brain.*