ai-prof / IDEATION.md
pranavkarthik10's picture
Deploy AI Prof hackathon submission
81e3ca2 verified
|
Raw
History Blame Contribute Delete
13.1 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Build Small Hackathon β€” Ideation & Decisions

Working notes from ideation. Project: AI Prof (primary submission). Hackathon: https://huggingface.co/build-small-hackathon Β· Field guide: https://build-small-hackathon-field-guide.hf.space/

Hackathon constraints (the rules that shape everything)

  • Model size: ≀32B parameters PER MODEL (not aggregate). You can freely combine multiple small models as long as each one individually stays under the cap. ("not just active params")
  • No sponsor exclusivity β€” a sponsor's model just has to be a core part of the experience; you may mix in other providers' models.
  • Platform: must be built in Gradio, hosted as a Hugging Face Space.
  • Deliverables: working Space + demo video + social media post.
  • Timeline: hack window June 5–15, 2026.
  • Credits: OpenAI $100 is for Codex (their coding agent), not API/inference. Modal $250, HF $20.

Tracks

  • Backyard AI β€” solve a genuine problem for someone you personally know. Judged on specificity, actual user adoption, and appropriate model fit. ← our track.
  • Thousand Token Wood (TTW) β€” delightful/unconventional, AI must be load-bearing (game, story, art).

Prize surface (it stacks β€” one app can hit many)

Main tracks ($18k): 1st–4th per track + Community Choice ($2k). Sponsor awards:

  • OpenBMB $10k β€” build with MiniCPM (incl. vision MiniCPM-V / omni MiniCPM-o).
  • OpenAI $10k β€” won via Codex-attributed commits in the repo/Space (build-tool track, model-agnostic).
  • NVIDIA β€” two RTX 5080 GPUs; build with Nemotron. One for "best space", one for community engagement (likes).
  • Modal $20k credits β€” use Modal for dev/runtime, note in README. Special awards ($8k): Bonus Quest Champion $2k Β· Off-Brand (best custom UI via gr.Server) $1.5k Β· Tiny Titan (best app on ≀4B model) $1.5k Β· Best Demo $1k Β· Best Agent $1k Β· Judges' Wildcard $1k. Six merit badges: Off the Grid (no cloud APIs) Β· Well-Tuned (publish fine-tune) Β· Off-Brand (custom UI) Β· Llama Champion (llama.cpp runtime) Β· Sharing is Caring (publish agent trace) Β· Field Notes (build blog).

Decision: build the AI Prof (Backyard AI)

Problem (real): a specific classmate finds that having the slides isn't enough β€” they're static and lack the in-class explanation. Test with real, anonymized lecture slides from our class.

Core loop: upload lecture PDF β†’ model reads each slide as an image β†’ explains it like a TA, streamed in real time β†’ classmate can interject with a question at any moment.

Two-model architecture (stacks OpenBMB + NVIDIA, cleanly justified)

  • MiniCPM-V (~4.1B, GGUF) = the eyes. Reads each slide as an image (diagrams, equations, layout β€” not just scraped text). Core β†’ satisfies OpenBMB. Runs via llama.cpp β†’ Llama Champion badge.
  • Nemotron 3 Nano (9B, or 30B-A3B MoE = only ~3.6B active β†’ fast decode) = the brain. Turns the slide reading into the spoken explanation and answers interjections. Reasoning/agentic β†’ NVIDIA fit.
  • Per-model cap means no param-budget tension between the two. Division of labor (see/explain) survives the "core part of the experience" test β†’ not sponsor-stuffing.

Prize map for this one app

Backyard placement Β· OpenBMB $10k Β· NVIDIA RTX 5080 Β· OpenAI $10k (Codex commits) Β· Modal $20k (self-host) Β· Off-Brand $1.5k+badge (custom whiteboard UI) Β· Llama Champion Β· Off the Grid Β· Sharing is Caring (publish a teaching-session trace) Β· Field Notes (blog) Β· Best Demo. β†’ realistically 6+ surfaces.

Scope discipline (vertical slice order)

  1. Slide upload β†’ MiniCPM-V reads β†’ Nemotron explains, streamed in a simple custom UI. (Submittable alone.)
  2. Text interjection (pause, ask, resume).
  3. Whiteboard β€” model emits Mermaid / Excalidraw JSON (structured; small models do this reliably). Avoid freeform tldraw generation β€” that's the day-eating trap.
  4. Voice / TTS β€” only with slack.

Real-time architecture

Goal: feel like a live lecture β€” explanation streams as if the prof is talking through each slide, smooth slide-to-slide, interruptible mid-sentence.

  • Token streaming: stream Nemotron output to the UI (Gradio generators) so it appears as it's generated.
  • Complete index before teaching: process the full deck before starting the lecture. The professor needs a global map to choose slides intelligently and answer interjections. Show preparation progress in Gradio.
  • Two-stage per slide, cached: (a) MiniCPM-V β†’ a structured "slide reading" (text + diagram desc + equations), computed once and cached; (b) Nemotron β†’ the explanation. Interjections reuse cached (a), never re-run vision.
  • Interjection = interrupt + branch: need a cancellable generation (async cancel token / threading.Event). On input: stop current stream, answer using the cached slide reading + history as context, then resume.
  • Streaming TTS (optional): chunk explanation into sentences, TTS each as generated, play sequentially. Barge-in (interrupt-on-speech) is hard mode β€” defer.
  • Session state: { slides[], current_index, cached_slide_readings{}, conversation_history }.
  • Hosting tradeoff for low latency: free HF Space hardware likely too slow for 2 models in real time. Lean: Modal GPU as inference backend (uses the Modal track) + Gradio Space as frontend. Self-hosting (not an external inference API) still supports the Off the Grid badge.

Voice (STT + TTS) β€” makes it feel like class, but it's the deepest rabbit hole

Two more tiny, on-HF models (free under the ≀32B-per-model cap; FAQ uses "a 7B speech model" as its example). Keep both open + self-hosted (not ElevenLabs/Deepgram/OpenAI API) β†’ protects the Off the Grid badge.

Pipeline: TTS (out) speaks Nemotron's streamed explanation; STT (in) transcribes the spoken interjection into text for Nemotron; VAD is the referee that detects when the classmate starts talking β†’ triggers barge-in.

  • STT (interjections) β€” pick: Whisper, or Moonshine for speed.
    • faster-whisper / distil-whisper (large-v3 ~1.5B, or base/small for latency) β€” accurate, OpenAI open weights.
    • Moonshine (~tiny) β€” built for real-time/on-device, faster time-to-text on short clips.
    • Interjections are short β†’ small model is fine; time-to-text, not accuracy, is the bottleneck. Start with faster-whisper-base, switch to Moonshine only if laggy.
  • TTS (narration) β€” pick: Kokoro, fallback Piper.
    • Kokoro-82M (hexgrad/Kokoro-82M) β€” 82M, good quality, fast time-to-first-audio, streamable. Sweet spot.
    • Piper β€” even lower latency, CPU-friendly, slightly more robotic. Use if speech layer isn't on GPU.
    • Stream sentence-by-sentence: synthesize + play each sentence as Nemotron emits it (audio ~1 sentence behind gen).
  • Barge-in / turn-taking β€” use FastRTC (fastrtc, the Gradio/HF WebRTC stack). Gives low-latency mic+playback over WebRTC, built-in VAD turn detection (ReplyOnPause), and the hook to cancel playback + generation the instant the user speaks. Avoids hand-rolling silence detection; reuses our cancellable-generation design.
    • Loop: narrate (TTS) β†’ Silero VAD hears speech β†’ kill TTS + cancel Nemotron β†’ buffer to end-of-speech β†’ STT β†’ answer using the cached slide reading as context β†’ TTS answer β†’ resume narration.
  • Latency budget (what "feels live" needs): minimize time-to-first-audio. STT small (100–300ms) + Nemotron MoE fast prefill + Kokoro first chunk (tens of ms). Run STT/TTS/VAD on the same Modal GPU as Nemotron (or CPU for Piper+Silero) to avoid network hops.
  • Scope ladder (don't let voice eat the week):
    1. TTS narration only β€” Prof talks, classmate types to interject. Low risk, already feels real-time.
    2. Push-to-talk interject β€” hold key / tap to ask aloud β†’ STT. No VAD/barge-in. ~90% of magic, ~30% of work.
    3. Full-duplex barge-in via FastRTC + VAD β€” only once 1–2 are solid. The 2β†’3 jump is where time goes.
  • Model count after voice: MiniCPM-V (vision) + Nemotron (brain) + Whisper/Moonshine (STT) + Kokoro/Piper (TTS)
    • Silero VAD. Multi-model is explicitly blessed; more "appropriate model fit" surface, but more integration.

Agent loop, tools & slide grounding

The brain (Nemotron) runs as a tool-using agent driving the lecture β€” professor-esque: decides which slide to be on, when to draw vs. just talk, when to jump back. (Strengthens the Best Agent award; the tool-call sequence is the publishable agent trace β†’ Sharing is Caring badge.)

Tools (mutate UI / session state):

  • goto_slide(i) / next_slide() / prev_slide() β€” navigation; lets it jump back to a referenced slide.
  • look_closer(question) β€” on-demand real-time MiniCPM-V call on the current slide image for detail.
  • draw(mermaid | excalidraw_json) β€” render on the whiteboard surface.
  • clear_whiteboard()
  • highlight_region(bbox) β€” optional, later.
  • narration = the free-text part of the response β†’ streamed to TTS.

Control flow: orchestrator loop. Each turn the agent gets { current slide reading, deck outline, recent history, trigger } where trigger = "continue lecture" | "user asked: ". It returns narration + optional tool calls; the orchestrator executes tools (swap displayed slide, render whiteboard) and streams narration. Reserve heavy reasoning for decision points (between slides), not during narration, or latency balloons.

Grounding β€” preprocess (breadth) + real-time (depth), both:

  • On upload (once; complete before lecture begins):
    • render each slide β†’ image (serves both display and vision),
    • MiniCPM-V β†’ cached structured slide reading (title, bullets, equations, diagram desc, key concepts),
    • PDF text-layer extraction (exact text ground-truth; vision misreads text) to complement vision,
    • build a deck outline / index (slide β†’ title/concepts) β€” this is what lets the agent plan and pick a slide; it can't navigate to a slide it has never seen.
  • During lecture (look_closer): targeted MiniCPM-V look at the actual slide with the specific question in context β€” handles the long tail (specific visual questions, detail the cached summary glossed over).
  • Why both: preprocess = global map + fast/cheap narration + navigation; real-time = accuracy on specific asks.

UI: show the REAL slide, never a summary.

  • Main surface = the actual rendered slide image / PDF page, synced to the agent's current_slide.
  • Whiteboard = a separate adjacent canvas (Mermaid/Excalidraw render) so drawings read as the prof's annotations, not edits to the original slide. (Region-highlight overlay on the slide can come later.)
  • Plus: live caption / transcript + mic / interject control.

The detailed deployment, deck-cache, teaching-beat, interruption, speech, and whiteboard decisions are in ARCHITECTURE.md.

Fine-tuning (after core pipeline β€” not before)

Chosen direction: teaching-style SFT / guided learning β€” tune the brain (Nemotron) to explain like a good TA: analogies, checks for understanding, concise, no preamble. Bootstrap dataset = (slide reading β†’ ideal explanation) pairs generated with a strong model + a little hand-curation; QLoRA via TRL / Unsloth on Modal; publish the adapter to HF β†’ Well-Tuned badge (+ feeds Bonus Quest Champion). Put a before/after note in the model card. Strictly a step-2 enhancement β€” only once the live pipeline works and there's a clear objective.

Backburner ideas (TTW β€” only if a 2nd submission is feasible)

  • Emotion-driven TRPG / choose-your-own-adventure: free text β†’ LLM updates structured NPC emotional state (trust/fear/affection…) β†’ state steers the story. Best Agent + Sharing is Caring (emotion-delta traces).
  • AI Garlic Phone: message/drawing passed down a chain of model personas, mutating; needs a reveal payoff. If a drawing is passed, pulls in vision = OpenBMB again.
  • Sketch-to-story adventure: you draw your action, MiniCPM-V interprets the doodle, the world reacts. Fuses vision + emotion mechanic. Strong demo, only-AI-can-do-this.
  • Model-vs-model battle β†’ traces for training (browser-brawl-style, different domain): most technically impressive (Best Agent + Well-Tuned + Sharing is Caring) but research-shaped; the training loop can eat the week.

Key model references