Spaces:
Running
Running
| # Build Small Hackathon β Ideation & Decisions | |
| > Working notes from ideation. Project: **AI Prof** (primary submission). | |
| > Hackathon: https://huggingface.co/build-small-hackathon Β· Field guide: https://build-small-hackathon-field-guide.hf.space/ | |
| ## Hackathon constraints (the rules that shape everything) | |
| - **Model size: β€32B parameters PER MODEL** (not aggregate). You can freely combine multiple | |
| small models as long as each one individually stays under the cap. ("not just active params") | |
| - **No sponsor exclusivity** β a sponsor's model just has to be *a core part of the experience*; | |
| you may mix in other providers' models. | |
| - **Platform:** must be built in **Gradio**, hosted as a **Hugging Face Space**. | |
| - **Deliverables:** working Space + **demo video** + **social media post**. | |
| - **Timeline:** hack window June 5β15, 2026. | |
| - **Credits:** OpenAI $100 is for **Codex (their coding agent)**, *not* API/inference. Modal $250, HF $20. | |
| ## Tracks | |
| - **Backyard AI** β solve a genuine problem for *someone you personally know*. Judged on specificity, | |
| **actual user adoption**, and appropriate model fit. β our track. | |
| - **Thousand Token Wood (TTW)** β delightful/unconventional, AI must be load-bearing (game, story, art). | |
| ## Prize surface (it stacks β one app can hit many) | |
| Main tracks ($18k): 1stβ4th per track + Community Choice ($2k). | |
| Sponsor awards: | |
| - **OpenBMB $10k** β build with MiniCPM (incl. vision MiniCPM-V / omni MiniCPM-o). | |
| - **OpenAI $10k** β won via **Codex-attributed commits** in the repo/Space (build-tool track, model-agnostic). | |
| - **NVIDIA** β two RTX 5080 GPUs; build with **Nemotron**. One for "best space", one for community engagement (likes). | |
| - **Modal $20k credits** β use Modal for dev/runtime, note in README. | |
| Special awards ($8k): Bonus Quest Champion $2k Β· Off-Brand (best custom UI via `gr.Server`) $1.5k Β· | |
| Tiny Titan (best app on β€4B model) $1.5k Β· Best Demo $1k Β· Best Agent $1k Β· Judges' Wildcard $1k. | |
| Six merit badges: Off the Grid (no cloud APIs) Β· Well-Tuned (publish fine-tune) Β· Off-Brand (custom UI) Β· | |
| Llama Champion (llama.cpp runtime) Β· Sharing is Caring (publish agent trace) Β· Field Notes (build blog). | |
| ## Decision: build the AI Prof (Backyard AI) | |
| **Problem (real):** a specific classmate finds that having the slides isn't enough β they're static and | |
| lack the in-class explanation. Test with **real, anonymized lecture slides** from our class. | |
| **Core loop:** upload lecture PDF β model reads each slide *as an image* β explains it like a TA, | |
| streamed in real time β classmate can **interject** with a question at any moment. | |
| ### Two-model architecture (stacks OpenBMB + NVIDIA, cleanly justified) | |
| - **MiniCPM-V (~4.1B, GGUF)** = *the eyes.* Reads each slide as an image (diagrams, equations, layout β | |
| not just scraped text). Core β satisfies **OpenBMB**. Runs via **llama.cpp** β Llama Champion badge. | |
| - **Nemotron 3 Nano (9B, or 30B-A3B MoE = only ~3.6B active β fast decode)** = *the brain.* Turns the | |
| slide reading into the spoken explanation and answers interjections. Reasoning/agentic β **NVIDIA** fit. | |
| - Per-model cap means no param-budget tension between the two. Division of labor (see/explain) survives | |
| the "core part of the experience" test β not sponsor-stuffing. | |
| ### Prize map for this one app | |
| Backyard placement Β· OpenBMB $10k Β· NVIDIA RTX 5080 Β· OpenAI $10k (Codex commits) Β· Modal $20k (self-host) Β· | |
| Off-Brand $1.5k+badge (custom whiteboard UI) Β· Llama Champion Β· Off the Grid Β· Sharing is Caring (publish a | |
| teaching-session trace) Β· Field Notes (blog) Β· Best Demo. β realistically 6+ surfaces. | |
| ### Scope discipline (vertical slice order) | |
| 1. Slide upload β MiniCPM-V reads β Nemotron explains, **streamed** in a simple custom UI. *(Submittable alone.)* | |
| 2. Text interjection (pause, ask, resume). | |
| 3. Whiteboard β model emits **Mermaid / Excalidraw JSON** (structured; small models do this reliably). | |
| Avoid freeform tldraw generation β that's the day-eating trap. | |
| 4. Voice / TTS β only with slack. | |
| ## Real-time architecture | |
| Goal: feel like a *live* lecture β explanation streams as if the prof is talking through each slide, | |
| smooth slide-to-slide, interruptible mid-sentence. | |
| - **Token streaming:** stream Nemotron output to the UI (Gradio generators) so it appears as it's generated. | |
| - **Complete index before teaching:** process the full deck before starting the lecture. The professor needs a | |
| global map to choose slides intelligently and answer interjections. Show preparation progress in Gradio. | |
| - **Two-stage per slide, cached:** (a) MiniCPM-V β a structured "slide reading" (text + diagram desc + equations), | |
| computed once and cached; (b) Nemotron β the explanation. Interjections reuse cached (a), never re-run vision. | |
| - **Interjection = interrupt + branch:** need a *cancellable* generation (async cancel token / threading.Event). | |
| On input: stop current stream, answer using the cached slide reading + history as context, then resume. | |
| - **Streaming TTS (optional):** chunk explanation into sentences, TTS each as generated, play sequentially. | |
| Barge-in (interrupt-on-speech) is hard mode β defer. | |
| - **Session state:** { slides[], current_index, cached_slide_readings{}, conversation_history }. | |
| - **Hosting tradeoff for low latency:** free HF Space hardware likely too slow for 2 models in real time. | |
| Lean: **Modal GPU as inference backend** (uses the Modal track) + Gradio Space as frontend. Self-hosting | |
| (not an external inference API) still supports the *Off the Grid* badge. | |
| ## Voice (STT + TTS) β makes it feel like *class*, but it's the deepest rabbit hole | |
| Two more tiny, on-HF models (free under the β€32B-*per-model* cap; FAQ uses "a 7B speech model" as its example). | |
| Keep both **open + self-hosted** (not ElevenLabs/Deepgram/OpenAI *API*) β protects the **Off the Grid** badge. | |
| **Pipeline:** TTS (out) speaks Nemotron's streamed explanation; STT (in) transcribes the spoken interjection | |
| into text for Nemotron; **VAD** is the referee that detects *when* the classmate starts talking β triggers barge-in. | |
| - **STT (interjections) β pick: Whisper, or Moonshine for speed.** | |
| - `faster-whisper` / `distil-whisper` (large-v3 ~1.5B, or base/small for latency) β accurate, OpenAI open weights. | |
| - **Moonshine** (~tiny) β built for real-time/on-device, faster time-to-text on short clips. | |
| - Interjections are short β small model is fine; *time-to-text*, not accuracy, is the bottleneck. Start with | |
| `faster-whisper-base`, switch to Moonshine only if laggy. | |
| - **TTS (narration) β pick: Kokoro, fallback Piper.** | |
| - **Kokoro-82M** (`hexgrad/Kokoro-82M`) β 82M, good quality, fast time-to-first-audio, streamable. Sweet spot. | |
| - **Piper** β even lower latency, CPU-friendly, slightly more robotic. Use if speech layer isn't on GPU. | |
| - Stream sentence-by-sentence: synthesize + play each sentence as Nemotron emits it (audio ~1 sentence behind gen). | |
| - **Barge-in / turn-taking β use FastRTC** (`fastrtc`, the Gradio/HF WebRTC stack). Gives low-latency mic+playback | |
| over WebRTC, built-in **VAD turn detection** (`ReplyOnPause`), and the hook to cancel playback + generation the | |
| instant the user speaks. Avoids hand-rolling silence detection; reuses our cancellable-generation design. | |
| - Loop: narrate (TTS) β Silero VAD hears speech β kill TTS + cancel Nemotron β buffer to end-of-speech β STT β | |
| answer using the **cached slide reading** as context β TTS answer β resume narration. | |
| - **Latency budget (what "feels live" needs):** minimize time-to-first-audio. STT small (~100β300ms) + Nemotron | |
| MoE fast prefill + Kokoro first chunk (~tens of ms). Run STT/TTS/VAD on the **same Modal GPU** as Nemotron | |
| (or CPU for Piper+Silero) to avoid network hops. | |
| - **Scope ladder (don't let voice eat the week):** | |
| 1. **TTS narration only** β Prof talks, classmate *types* to interject. Low risk, already feels real-time. | |
| 2. **Push-to-talk interject** β hold key / tap to ask aloud β STT. No VAD/barge-in. ~90% of magic, ~30% of work. | |
| 3. **Full-duplex barge-in** via FastRTC + VAD β only once 1β2 are solid. The 2β3 jump is where time goes. | |
| - **Model count after voice:** MiniCPM-V (vision) + Nemotron (brain) + Whisper/Moonshine (STT) + Kokoro/Piper (TTS) | |
| + Silero VAD. Multi-model is explicitly blessed; more "appropriate model fit" surface, but more integration. | |
| ## Agent loop, tools & slide grounding | |
| The brain (Nemotron) runs as a **tool-using agent** driving the lecture β professor-esque: decides which slide | |
| to be on, when to draw vs. just talk, when to jump back. (Strengthens the **Best Agent** award; the tool-call | |
| sequence *is* the publishable agent trace β **Sharing is Caring** badge.) | |
| **Tools (mutate UI / session state):** | |
| - `goto_slide(i)` / `next_slide()` / `prev_slide()` β navigation; lets it jump back to a referenced slide. | |
| - `look_closer(question)` β on-demand **real-time** MiniCPM-V call on the *current slide image* for detail. | |
| - `draw(mermaid | excalidraw_json)` β render on the whiteboard surface. | |
| - `clear_whiteboard()` | |
| - `highlight_region(bbox)` β optional, later. | |
| - narration = the free-text part of the response β streamed to TTS. | |
| **Control flow:** orchestrator loop. Each turn the agent gets `{ current slide reading, deck outline, recent | |
| history, trigger }` where trigger = "continue lecture" | "user asked: <q>". It returns narration + optional | |
| tool calls; the orchestrator executes tools (swap displayed slide, render whiteboard) and streams narration. | |
| **Reserve heavy reasoning for decision points (between slides), not during narration**, or latency balloons. | |
| **Grounding β preprocess (breadth) + real-time (depth), both:** | |
| - **On upload (once; complete before lecture begins):** | |
| - render each slide β image (serves *both* display and vision), | |
| - **MiniCPM-V** β cached structured slide reading (title, bullets, equations, diagram desc, key concepts), | |
| - **PDF text-layer** extraction (exact text ground-truth; vision misreads text) to complement vision, | |
| - build a **deck outline / index** (slide β title/concepts) β this is what lets the agent *plan* and *pick* a | |
| slide; it can't navigate to a slide it has never seen. | |
| - **During lecture (`look_closer`):** targeted MiniCPM-V look at the actual slide *with the specific question in | |
| context* β handles the long tail (specific visual questions, detail the cached summary glossed over). | |
| - Why both: preprocess = global map + fast/cheap narration + navigation; real-time = accuracy on specific asks. | |
| **UI: show the REAL slide, never a summary.** | |
| - Main surface = the actual rendered slide **image / PDF page**, synced to the agent's `current_slide`. | |
| - **Whiteboard = a separate adjacent canvas** (Mermaid/Excalidraw render) so drawings read as the prof's | |
| annotations, not edits to the original slide. (Region-highlight overlay on the slide can come later.) | |
| - Plus: live caption / transcript + mic / interject control. | |
| The detailed deployment, deck-cache, teaching-beat, interruption, speech, and whiteboard decisions are in | |
| [`ARCHITECTURE.md`](ARCHITECTURE.md). | |
| ## Fine-tuning (after core pipeline β not before) | |
| Chosen direction: **teaching-style SFT / guided learning** β tune the brain (Nemotron) to explain like a good | |
| TA: analogies, checks for understanding, concise, no preamble. Bootstrap dataset = (slide reading β ideal | |
| explanation) pairs generated with a strong model + a little hand-curation; **QLoRA via TRL / Unsloth on Modal**; | |
| publish the adapter to HF β **Well-Tuned** badge (+ feeds Bonus Quest Champion). Put a before/after note in the | |
| model card. Strictly a step-2 enhancement β only once the live pipeline works and there's a clear objective. | |
| ## Backburner ideas (TTW β only if a 2nd submission is feasible) | |
| - **Emotion-driven TRPG / choose-your-own-adventure:** free text β LLM updates structured NPC emotional | |
| state (trust/fear/affectionβ¦) β state steers the story. Best Agent + Sharing is Caring (emotion-delta traces). | |
| - **AI Garlic Phone:** message/drawing passed down a chain of model personas, mutating; needs a reveal payoff. | |
| If a drawing is passed, pulls in vision = OpenBMB again. | |
| - **Sketch-to-story adventure:** you draw your action, MiniCPM-V interprets the doodle, the world reacts. | |
| Fuses vision + emotion mechanic. Strong demo, only-AI-can-do-this. | |
| - **Model-vs-model battle β traces for training** (browser-brawl-style, different domain): most technically | |
| impressive (Best Agent + Well-Tuned + Sharing is Caring) but research-shaped; the training loop can eat the week. | |
| ## Key model references | |
| - MiniCPM-V-4 (4.1B, multimodal): https://huggingface.co/openbmb/MiniCPM-V-4 | |
| - MiniCPM-V-4 GGUF (llama.cpp): https://huggingface.co/openbmb/MiniCPM-V-4-gguf | |
| - Nemotron 3 Nano collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3 | |
| - Nemotron 3 Nano 4B: https://huggingface.co/blog/nvidia/nemotron-3-nano-4b | |
| - Nemotron 3 Nano 30B-A3B: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | |