Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.17.3
Build Small Hackathon Advisor โ Design & Implementation Notes
A small-model agent with text and voice input that investigates what other people have already built for the Build Small Hackathon and brainstorms an original new design with you. Output = streaming text + live visuals (no TTS). All models small, open-weight, run locally.
The literal "advisor" is the engine; the user-facing experience is The Unwritten Almanac โ Mothback, an owl-moth archivist, keeps the Wood's book of fates and divines you a still-unwritten project page (ink bleeds + cites real Spaces if you overlap, blooms gold if it's new). This project is itself a Build Small submission (hack window 2026-06-05 โ 2026-06-15).
1. Locked decisions & review corrections (2026-06-07)
A multi-agent adversarial review (5 dimensions, web-verified) set the direction below. This section is the authoritative decision log; the rest of the doc is written to be consistent with it.
Locked decisions (Jacob):
- Concept = The Unwritten Almanac (chosen 2026-06-07 from a 12-concept brainstorm). Mothback the owl-moth archivist divines a fate-page; ink bleeds and cites the real Spaces you overlap (page 47, page 112โฆ), or blooms gold + sprouts a leaf when it's unwritten. Engine unchanged underneath (crawl โ whitespace/originality โ score). The dry "advisor" stays under the hood. Full spec + de-risking grafts in ยง2.
- Text-first with voice input. The core workflow remains typed/editable text. Voice records or uploads a note, transcribes it with batch ASR, and places the transcript in the same idea box. Real-time streaming + in-browser turn detection are deferred.
- Add a ๐ฏ Well-Tuned fine-tune โ a small LoRA (MiniCPM5 advisor persona / tool-calling), trained on Modal, published to the Hub โ 6/6 badges โ strong shot at ๐๏ธ Bonus Quest Champion ($2,000).
- ASR = Nemotron batch.
nvidia/nemotron-speech-streaming-en-0.6bruns through NVIDIA NeMo in a ZeroGPU function. Audio is normalized to mono WAV before callingtranscribe([wav]).
Verified corrections:
- Drop SGLang. It needs a persistent GPU process โ incompatible with ZeroGPU (same root cause as vLLM). Run
MiniCPM5 via plain
transformersinside@spaces.GPUand parse its XML tool calls in our own code. - gr.Server custom UI streaming IS shipped (the launch blog only deferred the explanation). The deployed browser
UI calls our own same-origin
/api/agent-turnNDJSON stream withfetch;_engine_turnitself is wrapped in@spaces.GPU, so the real MiniCPM5 + LoRA path still runs on ZeroGPU. The@app.api("/agent_turn")generator stays available for Gradio/Python clients and contract checks, but the visible app no longer depends on the CDN@gradio/clientpath after real Space testing showed that browser turn could hang while the backend completed. - OpenAI Track has NO model requirement ("OpenAI's own podium across all submissions") โ auto-entered; a free lottery ticket. Do NOT add gpt-oss (breaks Tiny Titan, dilutes the small-model thesis). Deliberate non-target.
- Badges = 6 total (Tiny Titan is a $1.5k special award, not a badge). Decision #3 takes us from 5/6 โ 6/6.
- Tiny Titan = "best โค4B model"; our largest single model is MiniCPM5 (1.08B), total stack ~1.9B โ eligible.
New build requirements surfaced by the review (designed into the sections below):
- Jargon alias layer (ยง7): a 0.6B ASR mistranscribes our own vocab (Nemotron, MiniCPM5, EmbeddingGemma, ZeroGPUโฆ). Deterministic code-side fuzzy/alias map over our small CLOSED vocab, applied before any tool call and before display. Surface "heard: neutron โ Nemotron" as a delightful trust moment. (Active once voice is added.)
- Tool-call degradation ladder (ยง8): the 1B brain WILL emit broken tool calls (MiniCPM5-1B has a documented "broken tool calling" report). Wrap parse in try/except, retry once at low temp, validate name+args vs JSON-Schema in code (reject-and-repair), canned lines for empty results, a token watchdog that shows "trying again" instead of dead air (the screen is the only feedback channel โ no TTS).
- Latency / optimistic UI (ยง9/ยง11): ZeroGPU cold start + 1B generation = seconds of potential dead air. Optimistic UI on submit, pre-animate the project wall, set a latency budget. (The torch.compile cold-start penalty does NOT apply โ we don't use it.)
Day-1 go/no-go spikes (before any feature work):
- Trivial
@spaces.GPUhello-cuda build GREEN on torch 2.8+, deps pinned, heavy deps added one at a time. gr.Serverminimal: staticindex.html+ one same-origin/api/agent-turnNDJSON stream, plus the retained@app.api()generator for external clients, on the real ZeroGPU Space.- Nemotron
nemo_toolkit[asr]install + one batchtranscribe()inside@spaces.GPU(decision #4).
2. Concept โ The Unwritten Almanac (text-first)
The engine, regardless of skin:
- Investigate the
build-small-hackathonHF org โ what Spaces exist, which models, what's saturated, and where the whitespace is โ using a local EmbeddingGemma index. - Brainstorm with the user: propose ideas, score them against a fixed rubric (originality vs. existing projects, delight, AI-necessity, feasibility, param budget, prize-fit), and maintain an idea board.
- Respond as streaming text + live visuals in a custom
gr.Serverfrontend (no TTS โ the visual is the "voice").
The skin (chosen): The Unwritten Almanac. Mothback, a dusty owl-moth archivist, keeps the Wood's book of fates. Every project already built in the org is an inked page; she divines you a destined entry on a still-blank page, the ink writing itself live.
The two-beat wow (this IS the engine, rendered):
- You type one line about yourself / your idea. Inked pages riffle past (each = a real crawled Space).
- Bleed: if your idea overlaps existing work, the ink seeps blood-red and cites the exact real Spaces โ "the
Wood already wrote this, on page 47 and page 112" (=
get_projectoverlap on the top retrieval hits). The burn is factual, so it can't fall flat the way a 1B's invented joke can. - Bloom: you say "write bolder"; the next entry flows gold, a green leaf sprouts โ "this page has never been
inked" (= a
find_whitespacegold candidate). - A wax seal presses in, lighting five quadrants as the idea qualifies (=
score_idea: Originality, Delight, AI-Necessity, Feasibility, Prize-Fit).
Engine โ skin mapping: search_projects/get_project overlap โ the bleed + citations; find_whitespace โ the
blank/gold pages; score_idea โ the wax-seal quadrants; save_idea โ the written fate-page; agent persona =
Mothback (Layer A system prompt + the ๐ฏ Well-Tuned LoRA = her voice).
Shareable artifact (Community Choice): the page exports as a PNG that looks torn from an ancient grimoire โ aged parchment, a coined fate-name as title, the self-written prophecy, the five-quadrant seal, and a verdict stamp ("UNWRITTEN ยท 0 echoes" vs "ECHO ร3"). Built-in caption: "Mothback inked my fate page for #BuildSmall โ UNWRITTEN." People compile draws into a "chapter" and dare friends to get a page that doesn't bleed.
Grafted de-risking (from runner-up concepts):
- Tone = dry-but-benevolent (Roastleaf's whiplash): the bleed-citation gently stings, the gold-bloom is sincerely delighted; the burn is true-by-construction (real cited Spaces).
- Templated structure (key risk-killer): bank entry/roast templates (citation + dry verdict + redemptive branch); the 1B only fills in real Space titles + the idea โ never improvises whole comedy.
- Latin-binomial fate-names (e.g. "Ludus Vocalis Infantium") via templated scaffolds โ built-in wit, backstops a 1B that might produce corny names.
- "You vs the Wood" margin glyph: a tiny cluster-dot thumbnail on the page showing your gold page among the inked crowd โ cheap SVG, visual PROOF the gap is real.
- Thin-org mitigation (load-bearing): precompute whitespace clusters at Modal build-time and pin several DISTINCT blank-page candidates so "write bolder" always lands on a real, varied gap (the org may be only ~30โ60 Spaces). Tune the echo threshold toward more frequent bleed so the demo always has its "low" before the "wow".
Defaults (revisit if time): single-page artifact first (chapter compiler later); page-numbers visible, real titles on hover (keep the burn aimed at the idea, not a named builder); seal animation = safe typewriter + static-stamp floor first, bespoke ink-reveal last. Voice input is batch ASR that fills the same idea box before the user presses Ink.
Input is text-first; the experience is fully delightful with typed input alone.
AI is genuinely load-bearing: embeddings power the whitespace/originality analysis and the LLM drives the investigate โ ideate โ score loop โ the experience collapses without the models (supports ๐ค Best Agent + TTW "AI necessity").
3. Model stack (confirmed exact repo IDs)
| Role | Model | Params | Runtime | License | Prize hook |
|---|---|---|---|---|---|
| STT (batch voice input) | nvidia/nemotron-speech-streaming-en-0.6b |
0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | ๐ฉ NVIDIA Nemotron Quest |
| LLM brain | openbmb/MiniCPM5-1B ("OpenCPM5") |
1.08B | transformers (self-parse XML) / llama.cpp | Apache-2.0 | ๐ฎ OpenBMB |
| Embedder | ggml-org/embeddinggemma-300m-qat-q8_0-GGUF |
~300M | llama.cpp / llama-cpp-python | Gemma | ๐ Off the Grid ยท ๐ฆ Llama Champion ยท ๐ข Modal |
| Fine-tune | LoRA on MiniCPM5 โ published to Hub | โ | PEFT / HF Jobs | โ | ๐ฏ Well-Tuned |
Total โ 1.98B params โ โค4B โ ๐ Tiny Titan eligible. All open-weight, all runnable locally โ ๐ Off the Grid.
Naming: "OpenCPM5 1B" =
openbmb/MiniCPM5-1B(MiniCPM 5.0, ~May 2026). "EmbeddingGemma 270M" =google/embeddinggemma-300m(308M total; 270M = non-embedding transformer params). SGLang dropped (ZeroGPU incompatible). STT is used in batch voice-note mode, not a persistent stream.
4. Deployment & architecture (single path)
With text-first + batch ASR, the old "streaming ASR vs ZeroGPU" Config A/B tension dissolves โ there is one path:
- ZeroGPU Gradio-SDK Space (free). GPU is attached only inside
@spaces.GPUcalls (default 60s, max ~120s, RTX Pro 6000 Blackwell,large=48 GB). Per-turn inference fits this model exactly. - Text-first runtime loop: user types โ custom
/api/agent-turnNDJSON endpoint โ one@spaces.GPUcall runs MiniCPM5 (tool loop, intransformers) โ streamed text tokens + live visual updates. The@app.api()endpoint remains as the Gradio-client contract for external checks. - Voice input: push-to-talk records an utterance or uploads a voice note โ
/api/transcribenormalizes audio with ffmpeg โ one@spaces.GPUcall runs Nemotron ASR through NeMo โ transcript fills the idea box. No persistent stream, no WebRTC, no TURN server. - Modal (build-time only): crawl the org + build the llama.cpp EmbeddingGemma vector index offline; the Space ships with checked-in project vectors. Runtime never calls Modal โ ๐ Off the Grid holds (see ยง10).
Off the Grid = no proprietary cloud inference APIs. Open weights on an HF GPU Space / local box / Modal all qualify.
Deferred: real-time streaming ASR and turn detection are not part of the shipped app.
5. Per-model implementation notes
5.1 ASR โ nvidia/nemotron-speech-streaming-en-0.6b (batch)
- Primary, batch usage (simple):
Runtime install:import nemo.collections.asr as nemo_asr asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-speech-streaming-en-0.6b") text = asr.transcribe(["utterance.wav"]) # 16 kHz mono WAV in; punctuated EN text outpackages.txtprovidesffmpegandlibsndfile1;requirements.txtpinsnemo_toolkit[asr]==2.7.3plus Cython and packaging. The app records or uploads audio, normalizes it to mono 16 kHz WAV, runs NeMo in a ZeroGPU function, then returns the transcript to the idea box. Hosted NVIDIA NIM API would break Off the Grid, so it is not used.
5.2 MiniCPM5-1B brain โ openbmb/MiniCPM5-1B (transformers, self-parsed XML)
- Context 128K, bilingual (EN/ZH), Apache-2.0.
enable_thinking=False,temperature=0.7, top_p=0.95for fast tool calls.from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B") model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B", torch_dtype="auto", device_map="auto") inputs = tok.apply_chat_template(messages, tools=TOOLS, add_generation_prompt=True, enable_thinking=False, tokenize=True, return_dict=True, return_tensors="pt").to(model.device) - Tool calling: pass JSON-Schema tools via the chat template
tools=arg; the model emits XML<function name="get_weather">{"city":"New York"}</function>. Parse this ourselves (SGLang dropped). Wrap parse in try/except and validate against the schema โ see the degradation ladder (ยง8). - Local / CPU & llama.cpp (Off the Grid ยท Llama Champion):
openbmb/MiniCPM5-1B-GGUF:Q4_K_M(688 MB) via llama.cpp or Ollama (CPU-viable). fp16 โ 3โ4 GB VRAM.openbmb/MiniCPM5-1B-MLXfor Apple Silicon. (llama.cpp MiniCPM5 tool-calling is a pending PR โ verify before relying on it for the badge runtime.) - 1B discipline: small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
5.4 EmbeddingGemma GGUF โ ggml-org/embeddinggemma-300m-qat-q8_0-GGUF
- Active retrieval model:
embeddinggemma-300m-qat-Q8_0.gguf, 768-dimensional normalized embeddings. - Build-time path: Modal remote function runs
llama-cpp-pythonwith mean pooling and writesdata/project_index.json. - Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search over checked-in project vectors.
- Evidence is recorded in index metadata: model repo, GGUF filename, runtime, dimensions, build source, builder script, llama-cpp-python version, and Modal app name.
5.5 llama.cpp support (๐ฆ Llama Champion)
The active Llama Champion path is the retrieval model: the project index is built with EmbeddingGemma GGUF through llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
| Model | llama.cpp? | Runtime | Notes |
|---|---|---|---|
openbmb/MiniCPM5-1B |
โ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
ggml-org/embeddinggemma-300m-qat-q8_0-GGUF |
โ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
| ASR (Nemotron) | โ | NeMo | FastConformer-RNNT |
The checked-in index and runtime query embedder must stay on the same GGUF file.
6. Agent context design (built for a 1B brain)
Core principle: the 1B model is a router + arg-filler. All heavy work (crawl, summarize, score, rank, dedup) lives in code. Keep live context to ~800โ1200 tokens of curated view, never raw data.
- Layer A โ System (static, ~250 tok): identity/character; hackathon hard rules (โค32B, Gradio Space, demo video) so it self-filters infeasible ideas; targeted prizes (biases ideation); reply style (short, one question at a time); explicit tool-use instructions + the canonical jargon vocabulary (so it can self-correct, ยง7).
- Layer B โ Session state (re-rendered each turn by code, ~300 tok): user profile; locked decisions (track, side quests, models); idea board (2โ3 candidates, one line + scores); compact "projects already seen" summary.
- Layer C โ Ephemeral (~300 tok): last 2โ3 turns; the most recent tool result as a refined card (not raw JSON).
7. Agent tool design
Few tools, few params each, short descriptions (1B-friendly). Heavy logic in code.
Jargon alias layer (input normalization). Before any tool call and before display, run ASR/user text through a
deterministic fuzzy/alias map over our small CLOSED vocab (model names and goal names) โ e.g. RapidFuzz
token_set_ratio / double-metaphone โ mapping "neutron"/"nemo tron" โ Nemotron, "mini cpm" โ MiniCPM5, "zero gpu" โ
ZeroGPU. Surface the correction ("heard: neutron โ Nemotron") as a trust-building, slightly delightful moment.
Research โ investigate existing projects (the core value). Data = build-small-hackathon org Spaces, pre-crawled
into a local snapshot + EmbeddingGemma index (keeps Off the Grid at runtime).
| Tool | Signature | Returns (refined) | Heavy work |
|---|---|---|---|
list_projects |
(track?, sort?) |
top-N project cards | HF Hub API + summarize |
search_projects |
(query) |
top 5 cards | EmbeddingGemma retrieval |
get_project |
(id) |
card + overlap-vs-board verdict | code computes overlap |
find_whitespace |
() |
under-explored niches | cluster the index, find gaps |
find_whitespace is the originality engine (TTW judges originality) โ it names where nobody has built yet.
Ideation / state.
| Tool | Signature | Purpose |
|---|---|---|
save_idea |
(title, pitch) |
add/update a candidate on the idea board |
score_idea |
(id) |
fixed (hardcoded) rubric โ scores + gaps; the 1B only triggers + verbalizes |
compare_ideas |
() |
rank the board, articulate tradeoffs |
make_plan |
(id) |
build plan + goals the current direction can support |
update_profile |
(field, value) |
record skills/time/prefs โ Layer B |
set_goals |
(goals[]) |
change selected goals โ updates Layer A bias |
8. Agent loop (single-hop + degradation ladder)
on user input (text; or voice โ batch ASR โ text):
normalize via jargon alias layer
ctx = LayerA + render_state(LayerB) + last_turns + last_tool_card
out = MiniCPM5(ctx, tools=TOOLS, enable_thinking=False, temp=0.7) # โ tool_call | reply
try: parse XML tool call
except / invalid name|args (vs JSON-Schema): # degradation ladder
retry once (tempโ0.3, "emit ONLY one valid tool call")
still bad โ run a safe default tool (find_whitespace) so the screen never freezes
if tool_call: card = run_tool(out); reply = MiniCPM5(ctx + card) # single follow-up, no long ReAct
empty/zero result โ canned advisor line (never say nothing)
stream reply tokens โ custom UI | token watchdog: no token in N s โ "trying again" visual (not dead air)
update_state(LayerB)
Max one tool-call then reply. A 1B can't sustain multi-step ReAct; wrap multi-step flows (search โ get_project โ score) into one code "research" action the model calls once. The degradation ladder is a first-class UX surface
(ยง11), not an error branch โ the screen is the only feedback channel (no TTS).
9. ZeroGPU deployment notes
import spaces; @spaces.GPU(duration=โฆ). GPU only inside decorated fns; Gradio-SDK Space only (no Docker ZeroGPU).- Load models at module level,
.to('cuda')once (emulated until first real GPU call); real compute inside the decorator. torch 2.8+; notorch.compile(use AOT). Quota PRO ~40 min/day โ never idle-hold the GPU. - Frontend โ backend via same-origin
fetch("/api/agent-turn")reading NDJSON from our FastAPI route. The GPU boundary is_engine_turn, decorated with@spaces.GPU;@app.api()endpoints remain available for Gradio-client tests and external callers. - All four models fit in
large(48 GB). Keep each@spaces.GPUcall short for queue priority.
10. Modal โ offline pipeline (build-time only โ preserves Off the Grid)
Modal = build-time; runtime never calls it. This is how the app claims both ๐ข Modal and ๐ Off the Grid. The canonical command is:
.venv/bin/modal run scripts/modal_build_project_index.py \
--projects data/projects.json \
--out data/project_index.json
The remote function installs llama-cpp-python, downloads
ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf, embeds every project card through
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
Latest successful run: hackathon-advisor-llama-index on Modal, producing a 100-document, 768-dimensional normalized
index at 2026-06-07T08:16:19+00:00.
11. Frontend โ gr.Server custom UI (๐จ Off-Brand)
No TTS โ the visual output is the agent's "voice"; it must carry the delight (this is what earns Off-Brand, and the TTW polish + Best Demo score). The visual world is The Unwritten Almanac (ยง2): a candlelit tree-hollow with a heavy open grimoire as the hero component.
gradio.Serveris a FastAPI subclass serving your own frontend while still exposing@app.api(name=...)functions for Gradio/Python clients. The visible app uses first-party@app.post()endpoints for deterministic browser behavior; the GPU boundary stays in the decorated engine function.from gradio import Server from fastapi.responses import HTMLResponse app = Server() @app.api(name="agent_turn", concurrency_limit=2) async def agent_turn(message: str): for token in run_agent_stream(message): # generator โ SSE yield token @app.get("/", response_class=HTMLResponse) # custom UI replaces Gradio's default page async def home(): return open("index.html").read() app.launch()- Frontend calls via
fetch("/api/agent-turn"), parses newline-delimited JSON events, and updates the grimoire asstart/token/donemessages arrive. Notes and chapter exports use/api/field-notesand/api/chapter. - UI surfaces (the grimoire is the canvas): streaming reply = ink writing itself (typewriter on already-streaming
tokens);
search_projects/overlap โ bleed animation + page-number citations (real titles on hover);find_whitespaceโ gold bloom + sprouting leaf + a one-shaft light-mask ("the page chooses you");score_ideaโ wax-seal five-quadrant stamp; the riffling inked pages (fast page-flip of real Spaces) double as the project-wall; export = the torn-grimoire PNG artifact (ยง2). Jargon-correction toasts (ยง7) read as Mothback's margin notes; optimistic-UI loading + watchdog states (ยง8) are her "the page is choosing its wordsโฆ". Cheap SFX: page-flip, quill scratch, wax-seal thunk. - Build the animation floor first: safe typewriter + static stamp ships first (graceful degradation โ the judges credited this); upgrade the ink-bleed / gold-bloom / seal-press last.
- Fallback: the backend (
tools.py/agent.py) is UI-agnostic โ if gr.Server misbehaves, fall back togr.Blocks+gr.HTML, losing only the $1500 Off-Brand badge, never the submission.
12. Prize mapping
| Target | How it's earned |
|---|---|
| ๐ Thousand Token Wood | The Unwritten Almanac (ยง2) โ the bleed-citation wow IS the engine rendered; AI load-bearing; original |
| ๐ Tiny Titan (special, $1.5k) | total ~1.98B, every model โค4B; largest single = MiniCPM5 1.08B |
| ๐ Off the Grid (badge) | all open weights run locally; offline index; no cloud inference at runtime |
| ๐ฏ Well-Tuned (badge) | published LoRA fine-tune of MiniCPM5 on the Hub (ยง10) โ 6/6 badges |
| ๐จ Off-Brand (badge + $1.5k) | gr.Server custom UI is the agent's output surface |
| ๐ฎ OpenBMB ($10k) | brain = MiniCPM5-1B ("OpenBMB pick") |
| ๐ฉ NVIDIA Quest (2ร RTX 5080) | ASR = Nemotron (ยง5.1) |
| ๐ฆ Llama Champion (badge) | EmbeddingGemma GGUF retrieval index and runtime query embeddings run through llama.cpp (ยง5.5) |
| ๐ก Sharing is Caring (badge) | publish the agent's tool-call trace to the Hub |
| ๐ Field Notes (badge) | this DESIGN.md โ a build blog post |
| ๐๏ธ Bonus Quest Champion ($2k) | 6/6 badges (needs the Well-Tuned fine-tune) |
| ๐ค Best Agent ($1k) | real multi-tool loop: investigate โ ideate โ score โ plan |
| ๐ข Modal ($20k credits) | offline crawl+embed + LoRA training on Modal (build-time, separated from runtime) |
| ๐ฌ Best Demo ($1k) | the mandatory demo video, made to sing (shared artifact + wow beat) |
| ๐ OpenAI ($10k) | auto-entered ("across all submissions"); free lottery ticket, not a target |
| โค๏ธ Community Choice ($2k) | shareable tweetable artifact from the experience |
6 badges = Off the Grid, Well-Tuned, Off-Brand, Llama Champion, Sharing is Caring, Field Notes. Awards stack across categories. Single-winner awards (Tiny Titan, Best Agent, Off-Brand, Best Demo) are eligibility โ win โ the shared lever is ยง11 custom-UI polish.
13. Risks / open items
- Deployment smoke tests are mandatory: ZeroGPU Space build, same-origin NDJSON browser streaming, and Nemotron
batch ASR in
@spaces.GPUmust be verified after every runtime dependency change. - EmbeddingGemma is gated โ accept Gemma terms +
HF_TOKENbefore any crawl/build. - MiniCPM5 tool-call reliability at 1B โ covered by the degradation ladder (ยง8); validate name+args in code.
- Concept skin โ chosen: The Unwritten Almanac (ยง2). Make-or-break is the bleed/bloom hero animation; build the safe typewriter + static-stamp floor first (graceful degradation), upgrade ink last. Watch the thin-org echo threshold + the dry-but-benevolent tone (real cited Spaces, never punch at a named builder).
- Param-budget claim โ document the 1.98B total in the README/Space card for Tiny Titan judging.
14. Build order
Text-first vertical slice first; voice input is now part of the app. Always keep a demoable artifact.
- Day-1 spikes (ยง1) โ get the three go/no-go builds green.
crawler.py+ Modal index โ crawl the org, embed with EmbeddingGemma, build the local index. You immediately see what everyone's building and where the whitespace is.tools.pyโ research + ideation tools + the hardcodedscore_idearubric + the jargon alias layer, over the index.agent.pyโ 3-layer context + single-hop loop + degradation ladder, MiniCPM5 viatransformers(self-parsed XML).app.pyโgr.Servercustom frontend (idea board, project/whitespace wall, streaming text), called via first-party/api/...endpoints; concept skin applied.- Well-Tuned LoRA โ small fine-tune on Modal โ publish to Hub (โ 6/6 badges).
- Voice input โ push-to-talk record and voice-note upload through Nemotron batch ASR in
/api/transcribe. - Polish + submission โ demo video + social post (Best Demo / Community Choice), publish agent trace (๐ก), write up Field Notes (๐).
Deferred: real-time streaming ASR and turn detection. The shipped path stays batch audio โ transcript โ editable idea.
15. Sources
Models: nemotron-speech-streaming-en-0.6b ยท MiniCPM5-1B ยท MiniCPM5-1B-GGUF ยท embeddinggemma-300m
Platforms: ZeroGPU docs ยท Introducing gradio.Server ยท Gradio Server Mode guide ยท Modal GPU ยท Modal model weights ยท Modal pricing ยท Build Small Hackathon
Verify-before-ship: Nemotron-in-ZeroGPU after dependency changes; MiniCPM5 license on the live card; llama.cpp MiniCPM5 tool-calling remains planned only and is not used by the deployed brain.