hackathon-advisor / DESIGN.md
JacobLinCool's picture
fix: stabilize llama embedding runtime
ca766b5 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

Build Small Hackathon Advisor โ€” Design & Implementation Notes

A small-model agent with text and voice input that investigates what other people have already built for the Build Small Hackathon and brainstorms an original new design with you. Output = streaming text + live visuals (no TTS). All models small, open-weight, run locally.

The literal "advisor" is the engine; the user-facing experience is The Unwritten Almanac โ€” Mothback, an owl-moth archivist, keeps the Wood's book of fates and divines you a still-unwritten project page (ink bleeds + cites real Spaces if you overlap, blooms gold if it's new). This project is itself a Build Small submission (hack window 2026-06-05 โ†’ 2026-06-15).


1. Locked decisions & review corrections (2026-06-07)

A multi-agent adversarial review (5 dimensions, web-verified) set the direction below. This section is the authoritative decision log; the rest of the doc is written to be consistent with it.

Locked decisions (Jacob):

  1. Concept = The Unwritten Almanac (chosen 2026-06-07 from a 12-concept brainstorm). Mothback the owl-moth archivist divines a fate-page; ink bleeds and cites the real Spaces you overlap (page 47, page 112โ€ฆ), or blooms gold + sprouts a leaf when it's unwritten. Engine unchanged underneath (crawl โ†’ whitespace/originality โ†’ score). The dry "advisor" stays under the hood. Full spec + de-risking grafts in ยง2.
  2. Text-first with voice input. The core workflow remains typed/editable text. Voice records or uploads a note, transcribes it with batch ASR, and places the transcript in the same idea box. Real-time streaming + in-browser turn detection are deferred.
  3. Add a ๐ŸŽฏ Well-Tuned fine-tune โ€” a small LoRA (MiniCPM5 advisor persona / tool-calling), trained on Modal, published to the Hub โ†’ 6/6 badges โ†’ strong shot at ๐ŸŽ–๏ธ Bonus Quest Champion ($2,000).
  4. ASR = Nemotron batch. nvidia/nemotron-speech-streaming-en-0.6b runs through NVIDIA NeMo in a ZeroGPU function. Audio is normalized to mono WAV before calling transcribe([wav]).

Verified corrections:

  • Drop SGLang. It needs a persistent GPU process โ†’ incompatible with ZeroGPU (same root cause as vLLM). Run MiniCPM5 via plain transformers inside @spaces.GPU and parse its XML tool calls in our own code.
  • gr.Server custom UI streaming IS shipped (the launch blog only deferred the explanation). The deployed browser UI calls our own same-origin /api/agent-turn NDJSON stream with fetch; _engine_turn itself is wrapped in @spaces.GPU, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The @app.api("/agent_turn") generator stays available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN @gradio/client path after real Space testing showed that browser turn could hang while the backend completed.
  • OpenAI Track has NO model requirement ("OpenAI's own podium across all submissions") โ†’ auto-entered; a free lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
  • Badges = 6 total (Tiny Titan is a $1.5k special award, not a badge). Decision #3 takes us from 5/6 โ†’ 6/6.
  • Tiny Titan = "best โ‰ค4B model"; our largest single model is MiniCPM5 (1.08B), total stack ~1.9B โ†’ eligible.

New build requirements surfaced by the review (designed into the sections below):

  • Jargon alias layer (ยง7): a 0.6B ASR mistranscribes our own vocab (Nemotron, MiniCPM5, EmbeddingGemma, ZeroGPUโ€ฆ). Deterministic code-side fuzzy/alias map over our small CLOSED vocab, applied before any tool call and before display. Surface "heard: neutron โ†’ Nemotron" as a delightful trust moment. (Active once voice is added.)
  • Tool-call degradation ladder (ยง8): the 1B brain WILL emit broken tool calls (MiniCPM5-1B has a documented "broken tool calling" report). Wrap parse in try/except, retry once at low temp, validate name+args vs JSON-Schema in code (reject-and-repair), canned lines for empty results, a token watchdog that shows "trying again" instead of dead air (the screen is the only feedback channel โ€” no TTS).
  • Latency / optimistic UI (ยง9/ยง11): ZeroGPU cold start + 1B generation = seconds of potential dead air. Optimistic UI on submit, pre-animate the project wall, set a latency budget. (The torch.compile cold-start penalty does NOT apply โ€” we don't use it.)

Day-1 go/no-go spikes (before any feature work):

  • Trivial @spaces.GPU hello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time.
  • gr.Server minimal: static index.html + one same-origin /api/agent-turn NDJSON stream, plus the retained @app.api() generator for external clients, on the real ZeroGPU Space.
  • Nemotron nemo_toolkit[asr] install + one batch transcribe() inside @spaces.GPU (decision #4).

2. Concept โ€” The Unwritten Almanac (text-first)

The engine, regardless of skin:

  1. Investigate the build-small-hackathon HF org โ€” what Spaces exist, which models, what's saturated, and where the whitespace is โ€” using a local EmbeddingGemma index.
  2. Brainstorm with the user: propose ideas, score them against a fixed rubric (originality vs. existing projects, delight, AI-necessity, feasibility, param budget, prize-fit), and maintain an idea board.
  3. Respond as streaming text + live visuals in a custom gr.Server frontend (no TTS โ€” the visual is the "voice").

The skin (chosen): The Unwritten Almanac. Mothback, a dusty owl-moth archivist, keeps the Wood's book of fates. Every project already built in the org is an inked page; she divines you a destined entry on a still-blank page, the ink writing itself live.

The two-beat wow (this IS the engine, rendered):

  • You type one line about yourself / your idea. Inked pages riffle past (each = a real crawled Space).
  • Bleed: if your idea overlaps existing work, the ink seeps blood-red and cites the exact real Spaces โ€” "the Wood already wrote this, on page 47 and page 112" (= get_project overlap on the top retrieval hits). The burn is factual, so it can't fall flat the way a 1B's invented joke can.
  • Bloom: you say "write bolder"; the next entry flows gold, a green leaf sprouts โ€” "this page has never been inked" (= a find_whitespace gold candidate).
  • A wax seal presses in, lighting five quadrants as the idea qualifies (= score_idea: Originality, Delight, AI-Necessity, Feasibility, Prize-Fit).

Engine โ†” skin mapping: search_projects/get_project overlap โ†’ the bleed + citations; find_whitespace โ†’ the blank/gold pages; score_idea โ†’ the wax-seal quadrants; save_idea โ†’ the written fate-page; agent persona = Mothback (Layer A system prompt + the ๐ŸŽฏ Well-Tuned LoRA = her voice).

Shareable artifact (Community Choice): the page exports as a PNG that looks torn from an ancient grimoire โ€” aged parchment, a coined fate-name as title, the self-written prophecy, the five-quadrant seal, and a verdict stamp ("UNWRITTEN ยท 0 echoes" vs "ECHO ร—3"). Built-in caption: "Mothback inked my fate page for #BuildSmall โ€” UNWRITTEN." People compile draws into a "chapter" and dare friends to get a page that doesn't bleed.

Grafted de-risking (from runner-up concepts):

  • Tone = dry-but-benevolent (Roastleaf's whiplash): the bleed-citation gently stings, the gold-bloom is sincerely delighted; the burn is true-by-construction (real cited Spaces).
  • Templated structure (key risk-killer): bank entry/roast templates (citation + dry verdict + redemptive branch); the 1B only fills in real Space titles + the idea โ€” never improvises whole comedy.
  • Latin-binomial fate-names (e.g. "Ludus Vocalis Infantium") via templated scaffolds โ€” built-in wit, backstops a 1B that might produce corny names.
  • "You vs the Wood" margin glyph: a tiny cluster-dot thumbnail on the page showing your gold page among the inked crowd โ€” cheap SVG, visual PROOF the gap is real.
  • Thin-org mitigation (load-bearing): precompute whitespace clusters at Modal build-time and pin several DISTINCT blank-page candidates so "write bolder" always lands on a real, varied gap (the org may be only ~30โ€“60 Spaces). Tune the echo threshold toward more frequent bleed so the demo always has its "low" before the "wow".

Defaults (revisit if time): single-page artifact first (chapter compiler later); page-numbers visible, real titles on hover (keep the burn aimed at the idea, not a named builder); seal animation = safe typewriter + static-stamp floor first, bespoke ink-reveal last. Voice input is batch ASR that fills the same idea box before the user presses Ink.

Input is text-first; the experience is fully delightful with typed input alone.

AI is genuinely load-bearing: embeddings power the whitespace/originality analysis and the LLM drives the investigate โ†’ ideate โ†’ score loop โ€” the experience collapses without the models (supports ๐Ÿค– Best Agent + TTW "AI necessity").


3. Model stack (confirmed exact repo IDs)

Role Model Params Runtime License Prize hook
STT (batch voice input) nvidia/nemotron-speech-streaming-en-0.6b 0.6B NeMo, GPU+CUDA NVIDIA Open Model (commercial OK) ๐ŸŸฉ NVIDIA Nemotron Quest
LLM brain openbmb/MiniCPM5-1B ("OpenCPM5") 1.08B transformers (self-parse XML) / llama.cpp Apache-2.0 ๐Ÿฎ OpenBMB
Embedder ggml-org/embeddinggemma-300m-qat-q8_0-GGUF ~300M llama.cpp / llama-cpp-python Gemma ๐Ÿ”Œ Off the Grid ยท ๐Ÿฆ™ Llama Champion ยท ๐ŸŸข Modal
Fine-tune LoRA on MiniCPM5 โ†’ published to Hub โ€” PEFT / HF Jobs โ€” ๐ŸŽฏ Well-Tuned

Total โ‰ˆ 1.98B params โ†’ โ‰ค4B โ†’ ๐Ÿœ Tiny Titan eligible. All open-weight, all runnable locally โ†’ ๐Ÿ”Œ Off the Grid.

Naming: "OpenCPM5 1B" = openbmb/MiniCPM5-1B (MiniCPM 5.0, ~May 2026). "EmbeddingGemma 270M" = google/embeddinggemma-300m (308M total; 270M = non-embedding transformer params). SGLang dropped (ZeroGPU incompatible). STT is used in batch voice-note mode, not a persistent stream.


4. Deployment & architecture (single path)

With text-first + batch ASR, the old "streaming ASR vs ZeroGPU" Config A/B tension dissolves โ€” there is one path:

  • ZeroGPU Gradio-SDK Space (free). GPU is attached only inside @spaces.GPU calls (default 60s, max ~120s, RTX Pro 6000 Blackwell, large=48 GB). Per-turn inference fits this model exactly.
  • Text-first runtime loop: user types โ†’ custom /api/agent-turn NDJSON endpoint โ†’ one @spaces.GPU call runs MiniCPM5 (tool loop, in transformers) โ†’ streamed text tokens + live visual updates. The @app.api() endpoint remains as the Gradio-client contract for external checks.
  • Voice input: push-to-talk records an utterance or uploads a voice note โ†’ /api/transcribe normalizes audio with ffmpeg โ†’ one @spaces.GPU call runs Nemotron ASR through NeMo โ†’ transcript fills the idea box. No persistent stream, no WebRTC, no TURN server.
  • Modal (build-time only): crawl the org + build the llama.cpp EmbeddingGemma vector index offline; the Space ships with checked-in project vectors. Runtime never calls Modal โ†’ ๐Ÿ”Œ Off the Grid holds (see ยง10).

Off the Grid = no proprietary cloud inference APIs. Open weights on an HF GPU Space / local box / Modal all qualify.

Deferred: real-time streaming ASR and turn detection are not part of the shipped app.


5. Per-model implementation notes

5.1 ASR โ€” nvidia/nemotron-speech-streaming-en-0.6b (batch)

  • Primary, batch usage (simple):
    import nemo.collections.asr as nemo_asr
    asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-speech-streaming-en-0.6b")
    text = asr.transcribe(["utterance.wav"])      # 16 kHz mono WAV in; punctuated EN text out
    
    Runtime install: packages.txt provides ffmpeg and libsndfile1; requirements.txt pins nemo_toolkit[asr]==2.7.3 plus Cython and packaging. The app records or uploads audio, normalizes it to mono 16 kHz WAV, runs NeMo in a ZeroGPU function, then returns the transcript to the idea box. Hosted NVIDIA NIM API would break Off the Grid, so it is not used.

5.2 MiniCPM5-1B brain โ€” openbmb/MiniCPM5-1B (transformers, self-parsed XML)

  • Context 128K, bilingual (EN/ZH), Apache-2.0. enable_thinking=False, temperature=0.7, top_p=0.95 for fast tool calls.
    from transformers import AutoModelForCausalLM, AutoTokenizer
    tok = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
    model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B", torch_dtype="auto", device_map="auto")
    inputs = tok.apply_chat_template(messages, tools=TOOLS, add_generation_prompt=True, enable_thinking=False,
                                     tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
    
  • Tool calling: pass JSON-Schema tools via the chat template tools= arg; the model emits XML <function name="get_weather">{"city":"New York"}</function>. Parse this ourselves (SGLang dropped). Wrap parse in try/except and validate against the schema โ€” see the degradation ladder (ยง8).
  • Local / CPU & llama.cpp (Off the Grid ยท Llama Champion): openbmb/MiniCPM5-1B-GGUF:Q4_K_M (688 MB) via llama.cpp or Ollama (CPU-viable). fp16 โ‰ˆ 3โ€“4 GB VRAM. openbmb/MiniCPM5-1B-MLX for Apple Silicon. (llama.cpp MiniCPM5 tool-calling is a pending PR โ€” verify before relying on it for the badge runtime.)
  • 1B discipline: small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.

5.4 EmbeddingGemma GGUF โ€” ggml-org/embeddinggemma-300m-qat-q8_0-GGUF

  • Active retrieval model: embeddinggemma-300m-qat-Q8_0.gguf, 768-dimensional normalized embeddings.
  • Build-time path: Modal remote function runs llama-cpp-python with mean pooling and writes data/project_index.json.
  • Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search over checked-in project vectors.
  • Evidence is recorded in index metadata: model repo, GGUF filename, runtime, dimensions, build source, builder script, llama-cpp-python version, and Modal app name.

5.5 llama.cpp support (๐Ÿฆ™ Llama Champion)

The active Llama Champion path is the retrieval model: the project index is built with EmbeddingGemma GGUF through llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.

Model llama.cpp? Runtime Notes
openbmb/MiniCPM5-1B โœ… planned only llama.cpp / Ollama Not used for deployed tool-calling; Transformers + LoRA is the deployed brain.
ggml-org/embeddinggemma-300m-qat-q8_0-GGUF โœ… active llama.cpp / llama-cpp-python Builds project vectors on Modal and embeds runtime queries in the Space.
ASR (Nemotron) โŒ NeMo FastConformer-RNNT

The checked-in index and runtime query embedder must stay on the same GGUF file.


6. Agent context design (built for a 1B brain)

Core principle: the 1B model is a router + arg-filler. All heavy work (crawl, summarize, score, rank, dedup) lives in code. Keep live context to ~800โ€“1200 tokens of curated view, never raw data.

  • Layer A โ€” System (static, ~250 tok): identity/character; hackathon hard rules (โ‰ค32B, Gradio Space, demo video) so it self-filters infeasible ideas; targeted prizes (biases ideation); reply style (short, one question at a time); explicit tool-use instructions + the canonical jargon vocabulary (so it can self-correct, ยง7).
  • Layer B โ€” Session state (re-rendered each turn by code, ~300 tok): user profile; locked decisions (track, side quests, models); idea board (2โ€“3 candidates, one line + scores); compact "projects already seen" summary.
  • Layer C โ€” Ephemeral (~300 tok): last 2โ€“3 turns; the most recent tool result as a refined card (not raw JSON).

7. Agent tool design

Few tools, few params each, short descriptions (1B-friendly). Heavy logic in code.

Jargon alias layer (input normalization). Before any tool call and before display, run ASR/user text through a deterministic fuzzy/alias map over our small CLOSED vocab (model names and goal names) โ€” e.g. RapidFuzz token_set_ratio / double-metaphone โ€” mapping "neutron"/"nemo tron" โ†’ Nemotron, "mini cpm" โ†’ MiniCPM5, "zero gpu" โ†’ ZeroGPU. Surface the correction ("heard: neutron โ†’ Nemotron") as a trust-building, slightly delightful moment.

Research โ€” investigate existing projects (the core value). Data = build-small-hackathon org Spaces, pre-crawled into a local snapshot + EmbeddingGemma index (keeps Off the Grid at runtime).

Tool Signature Returns (refined) Heavy work
list_projects (track?, sort?) top-N project cards HF Hub API + summarize
search_projects (query) top 5 cards EmbeddingGemma retrieval
get_project (id) card + overlap-vs-board verdict code computes overlap
find_whitespace () under-explored niches cluster the index, find gaps

find_whitespace is the originality engine (TTW judges originality) โ€” it names where nobody has built yet.

Ideation / state.

Tool Signature Purpose
save_idea (title, pitch) add/update a candidate on the idea board
score_idea (id) fixed (hardcoded) rubric โ†’ scores + gaps; the 1B only triggers + verbalizes
compare_ideas () rank the board, articulate tradeoffs
make_plan (id) build plan + goals the current direction can support
update_profile (field, value) record skills/time/prefs โ†’ Layer B
set_goals (goals[]) change selected goals โ†’ updates Layer A bias

8. Agent loop (single-hop + degradation ladder)

on user input (text; or voice โ†’ batch ASR โ†’ text):
  normalize via jargon alias layer
  ctx = LayerA + render_state(LayerB) + last_turns + last_tool_card
  out = MiniCPM5(ctx, tools=TOOLS, enable_thinking=False, temp=0.7)   # โ†’ tool_call | reply
  try: parse XML tool call
  except / invalid name|args (vs JSON-Schema):                         # degradation ladder
      retry once (tempโ‰ˆ0.3, "emit ONLY one valid tool call")
      still bad โ†’ run a safe default tool (find_whitespace) so the screen never freezes
  if tool_call: card = run_tool(out); reply = MiniCPM5(ctx + card)     # single follow-up, no long ReAct
  empty/zero result โ†’ canned advisor line (never say nothing)
  stream reply tokens โ†’ custom UI   |   token watchdog: no token in N s โ†’ "trying again" visual (not dead air)
  update_state(LayerB)

Max one tool-call then reply. A 1B can't sustain multi-step ReAct; wrap multi-step flows (search โ†’ get_project โ†’ score) into one code "research" action the model calls once. The degradation ladder is a first-class UX surface (ยง11), not an error branch โ€” the screen is the only feedback channel (no TTS).


9. ZeroGPU deployment notes

  • import spaces; @spaces.GPU(duration=โ€ฆ). GPU only inside decorated fns; Gradio-SDK Space only (no Docker ZeroGPU).
  • Load models at module level, .to('cuda') once (emulated until first real GPU call); real compute inside the decorator. torch 2.8+; no torch.compile (use AOT). Quota PRO ~40 min/day โ†’ never idle-hold the GPU.
  • Frontend โ†’ backend via same-origin fetch("/api/agent-turn") reading NDJSON from our FastAPI route. The GPU boundary is _engine_turn, decorated with @spaces.GPU; @app.api() endpoints remain available for Gradio-client tests and external callers.
  • All four models fit in large (48 GB). Keep each @spaces.GPU call short for queue priority.

10. Modal โ€” offline pipeline (build-time only โ†’ preserves Off the Grid)

Modal = build-time; runtime never calls it. This is how the app claims both ๐ŸŸข Modal and ๐Ÿ”Œ Off the Grid. The canonical command is:

.venv/bin/modal run scripts/modal_build_project_index.py \
  --projects data/projects.json \
  --out data/project_index.json

The remote function installs llama-cpp-python, downloads ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf, embeds every project card through llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.

Latest successful run: hackathon-advisor-llama-index on Modal, producing a 100-document, 768-dimensional normalized index at 2026-06-07T08:16:19+00:00.


11. Frontend โ€” gr.Server custom UI (๐ŸŽจ Off-Brand)

No TTS โ†’ the visual output is the agent's "voice"; it must carry the delight (this is what earns Off-Brand, and the TTW polish + Best Demo score). The visual world is The Unwritten Almanac (ยง2): a candlelit tree-hollow with a heavy open grimoire as the hero component.

  • gradio.Server is a FastAPI subclass serving your own frontend while still exposing @app.api(name=...) functions for Gradio/Python clients. The visible app uses first-party @app.post() endpoints for deterministic browser behavior; the GPU boundary stays in the decorated engine function.
    from gradio import Server
    from fastapi.responses import HTMLResponse
    app = Server()
    
    @app.api(name="agent_turn", concurrency_limit=2)
    async def agent_turn(message: str):
        for token in run_agent_stream(message):   # generator โ†’ SSE
            yield token
    
    @app.get("/", response_class=HTMLResponse)     # custom UI replaces Gradio's default page
    async def home(): return open("index.html").read()
    app.launch()
    
  • Frontend calls via fetch("/api/agent-turn"), parses newline-delimited JSON events, and updates the grimoire as start / token / done messages arrive. Notes and chapter exports use /api/field-notes and /api/chapter.
  • UI surfaces (the grimoire is the canvas): streaming reply = ink writing itself (typewriter on already-streaming tokens); search_projects/overlap โ†’ bleed animation + page-number citations (real titles on hover); find_whitespace โ†’ gold bloom + sprouting leaf + a one-shaft light-mask ("the page chooses you"); score_idea โ†’ wax-seal five-quadrant stamp; the riffling inked pages (fast page-flip of real Spaces) double as the project-wall; export = the torn-grimoire PNG artifact (ยง2). Jargon-correction toasts (ยง7) read as Mothback's margin notes; optimistic-UI loading + watchdog states (ยง8) are her "the page is choosing its wordsโ€ฆ". Cheap SFX: page-flip, quill scratch, wax-seal thunk.
  • Build the animation floor first: safe typewriter + static stamp ships first (graceful degradation โ€” the judges credited this); upgrade the ink-bleed / gold-bloom / seal-press last.
  • Fallback: the backend (tools.py/agent.py) is UI-agnostic โ€” if gr.Server misbehaves, fall back to gr.Blocks + gr.HTML, losing only the $1500 Off-Brand badge, never the submission.

12. Prize mapping

Target How it's earned
๐Ÿ„ Thousand Token Wood The Unwritten Almanac (ยง2) โ€” the bleed-citation wow IS the engine rendered; AI load-bearing; original
๐Ÿœ Tiny Titan (special, $1.5k) total ~1.98B, every model โ‰ค4B; largest single = MiniCPM5 1.08B
๐Ÿ”Œ Off the Grid (badge) all open weights run locally; offline index; no cloud inference at runtime
๐ŸŽฏ Well-Tuned (badge) published LoRA fine-tune of MiniCPM5 on the Hub (ยง10) โ†’ 6/6 badges
๐ŸŽจ Off-Brand (badge + $1.5k) gr.Server custom UI is the agent's output surface
๐Ÿฎ OpenBMB ($10k) brain = MiniCPM5-1B ("OpenBMB pick")
๐ŸŸฉ NVIDIA Quest (2ร— RTX 5080) ASR = Nemotron (ยง5.1)
๐Ÿฆ™ Llama Champion (badge) EmbeddingGemma GGUF retrieval index and runtime query embeddings run through llama.cpp (ยง5.5)
๐Ÿ“ก Sharing is Caring (badge) publish the agent's tool-call trace to the Hub
๐Ÿ““ Field Notes (badge) this DESIGN.md โ†’ a build blog post
๐ŸŽ–๏ธ Bonus Quest Champion ($2k) 6/6 badges (needs the Well-Tuned fine-tune)
๐Ÿค– Best Agent ($1k) real multi-tool loop: investigate โ†’ ideate โ†’ score โ†’ plan
๐ŸŸข Modal ($20k credits) offline crawl+embed + LoRA training on Modal (build-time, separated from runtime)
๐ŸŽฌ Best Demo ($1k) the mandatory demo video, made to sing (shared artifact + wow beat)
๐ŸŒ€ OpenAI ($10k) auto-entered ("across all submissions"); free lottery ticket, not a target
โค๏ธ Community Choice ($2k) shareable tweetable artifact from the experience

6 badges = Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes. Awards stack across categories. Single-winner awards (Tiny Titan, Best Agent, Off-Brand, Best Demo) are eligibility โ‰  win โ€” the shared lever is ยง11 custom-UI polish.


13. Risks / open items

  1. Deployment smoke tests are mandatory: ZeroGPU Space build, same-origin NDJSON browser streaming, and Nemotron batch ASR in @spaces.GPU must be verified after every runtime dependency change.
  2. EmbeddingGemma is gated โ€” accept Gemma terms + HF_TOKEN before any crawl/build.
  3. MiniCPM5 tool-call reliability at 1B โ€” covered by the degradation ladder (ยง8); validate name+args in code.
  4. Concept skin โ€” chosen: The Unwritten Almanac (ยง2). Make-or-break is the bleed/bloom hero animation; build the safe typewriter + static-stamp floor first (graceful degradation), upgrade ink last. Watch the thin-org echo threshold + the dry-but-benevolent tone (real cited Spaces, never punch at a named builder).
  5. Param-budget claim โ€” document the 1.98B total in the README/Space card for Tiny Titan judging.

14. Build order

Text-first vertical slice first; voice input is now part of the app. Always keep a demoable artifact.

  1. Day-1 spikes (ยง1) โ€” get the three go/no-go builds green.
  2. crawler.py + Modal index โ€” crawl the org, embed with EmbeddingGemma, build the local index. You immediately see what everyone's building and where the whitespace is.
  3. tools.py โ€” research + ideation tools + the hardcoded score_idea rubric + the jargon alias layer, over the index.
  4. agent.py โ€” 3-layer context + single-hop loop + degradation ladder, MiniCPM5 via transformers (self-parsed XML).
  5. app.py โ€” gr.Server custom frontend (idea board, project/whitespace wall, streaming text), called via first-party /api/... endpoints; concept skin applied.
  6. Well-Tuned LoRA โ€” small fine-tune on Modal โ†’ publish to Hub (โ†’ 6/6 badges).
  7. Voice input โ€” push-to-talk record and voice-note upload through Nemotron batch ASR in /api/transcribe.
  8. Polish + submission โ€” demo video + social post (Best Demo / Community Choice), publish agent trace (๐Ÿ“ก), write up Field Notes (๐Ÿ““).

Deferred: real-time streaming ASR and turn detection. The shipped path stays batch audio โ†’ transcript โ†’ editable idea.


15. Sources

Models: nemotron-speech-streaming-en-0.6b ยท MiniCPM5-1B ยท MiniCPM5-1B-GGUF ยท embeddinggemma-300m

Platforms: ZeroGPU docs ยท Introducing gradio.Server ยท Gradio Server Mode guide ยท Modal GPU ยท Modal model weights ยท Modal pricing ยท Build Small Hackathon

Verify-before-ship: Nemotron-in-ZeroGPU after dependency changes; MiniCPM5 license on the live card; llama.cpp MiniCPM5 tool-calling remains planned only and is not used by the deployed brain.