Spaces:

build-small-hackathon
/

ai-prof

Running

App Files Files Community

ai-prof / IDEATION.md

pranavkarthik10

Deploy AI Prof hackathon submission

81e3ca2 verified 15 days ago

preview code

Raw

History Blame Contribute Delete

13.1 kB

	# Build Small Hackathon — Ideation & Decisions

	> Working notes from ideation. Project: AI Prof (primary submission).
	> Hackathon: https://huggingface.co/build-small-hackathon · Field guide: https://build-small-hackathon-field-guide.hf.space/

	## Hackathon constraints (the rules that shape everything)
	- Model size: ≤32B parameters PER MODEL (not aggregate). You can freely combine multiple
	small models as long as each one individually stays under the cap. ("not just active params")
	- No sponsor exclusivity — a sponsor's model just has to be a core part of the experience;
	you may mix in other providers' models.
	- Platform: must be built in Gradio, hosted as a Hugging Face Space.
	- Deliverables: working Space + demo video + social media post.
	- Timeline: hack window June 5–15, 2026.
	- Credits: OpenAI $100 is for Codex (their coding agent), not API/inference. Modal $250, HF $20.

	## Tracks
	- Backyard AI — solve a genuine problem for someone you personally know. Judged on specificity,
	actual user adoption, and appropriate model fit. ← our track.
	- Thousand Token Wood (TTW) — delightful/unconventional, AI must be load-bearing (game, story, art).

	## Prize surface (it stacks — one app can hit many)
	Main tracks ($18k): 1st–4th per track + Community Choice ($2k).
	Sponsor awards:
	- OpenBMB $10k — build with MiniCPM (incl. vision MiniCPM-V / omni MiniCPM-o).
	- OpenAI $10k — won via Codex-attributed commits in the repo/Space (build-tool track, model-agnostic).
	- NVIDIA — two RTX 5080 GPUs; build with Nemotron. One for "best space", one for community engagement (likes).
	- Modal $20k credits — use Modal for dev/runtime, note in README.
	Special awards ($8k): Bonus Quest Champion $2k · Off-Brand (best custom UI via `gr.Server`) $1.5k ·
	Tiny Titan (best app on ≤4B model) $1.5k · Best Demo $1k · Best Agent $1k · Judges' Wildcard $1k.
	Six merit badges: Off the Grid (no cloud APIs) · Well-Tuned (publish fine-tune) · Off-Brand (custom UI) ·
	Llama Champion (llama.cpp runtime) · Sharing is Caring (publish agent trace) · Field Notes (build blog).

	## Decision: build the AI Prof (Backyard AI)
	Problem (real): a specific classmate finds that having the slides isn't enough — they're static and
	lack the in-class explanation. Test with real, anonymized lecture slides from our class.

	Core loop: upload lecture PDF → model reads each slide as an image → explains it like a TA,
	streamed in real time → classmate can interject with a question at any moment.

	### Two-model architecture (stacks OpenBMB + NVIDIA, cleanly justified)
	- MiniCPM-V (~4.1B, GGUF) = the eyes. Reads each slide as an image (diagrams, equations, layout —
	not just scraped text). Core → satisfies OpenBMB. Runs via llama.cpp → Llama Champion badge.
	- Nemotron 3 Nano (9B, or 30B-A3B MoE = only ~3.6B active → fast decode) = the brain. Turns the
	slide reading into the spoken explanation and answers interjections. Reasoning/agentic → NVIDIA fit.
	- Per-model cap means no param-budget tension between the two. Division of labor (see/explain) survives
	the "core part of the experience" test → not sponsor-stuffing.

	### Prize map for this one app
	Backyard placement · OpenBMB $10k · NVIDIA RTX 5080 · OpenAI $10k (Codex commits) · Modal $20k (self-host) ·
	Off-Brand $1.5k+badge (custom whiteboard UI) · Llama Champion · Off the Grid · Sharing is Caring (publish a
	teaching-session trace) · Field Notes (blog) · Best Demo. → realistically 6+ surfaces.

	### Scope discipline (vertical slice order)
	1. Slide upload → MiniCPM-V reads → Nemotron explains, streamed in a simple custom UI. (Submittable alone.)
	2. Text interjection (pause, ask, resume).
	3. Whiteboard — model emits Mermaid / Excalidraw JSON (structured; small models do this reliably).
	Avoid freeform tldraw generation — that's the day-eating trap.
	4. Voice / TTS — only with slack.

	## Real-time architecture
	Goal: feel like a live lecture — explanation streams as if the prof is talking through each slide,
	smooth slide-to-slide, interruptible mid-sentence.
	- Token streaming: stream Nemotron output to the UI (Gradio generators) so it appears as it's generated.
	- Complete index before teaching: process the full deck before starting the lecture. The professor needs a
	global map to choose slides intelligently and answer interjections. Show preparation progress in Gradio.
	- Two-stage per slide, cached: (a) MiniCPM-V → a structured "slide reading" (text + diagram desc + equations),
	computed once and cached; (b) Nemotron → the explanation. Interjections reuse cached (a), never re-run vision.
	- Interjection = interrupt + branch: need a cancellable generation (async cancel token / threading.Event).
	On input: stop current stream, answer using the cached slide reading + history as context, then resume.
	- Streaming TTS (optional): chunk explanation into sentences, TTS each as generated, play sequentially.
	Barge-in (interrupt-on-speech) is hard mode — defer.
	- Session state: { slides[], current_index, cached_slide_readings{}, conversation_history }.
	- Hosting tradeoff for low latency: free HF Space hardware likely too slow for 2 models in real time.
	Lean: Modal GPU as inference backend (uses the Modal track) + Gradio Space as frontend. Self-hosting
	(not an external inference API) still supports the Off the Grid badge.

	## Voice (STT + TTS) — makes it feel like class, but it's the deepest rabbit hole
	Two more tiny, on-HF models (free under the ≤32B-per-model cap; FAQ uses "a 7B speech model" as its example).
	Keep both open + self-hosted (not ElevenLabs/Deepgram/OpenAI API) → protects the Off the Grid badge.

	Pipeline: TTS (out) speaks Nemotron's streamed explanation; STT (in) transcribes the spoken interjection
	into text for Nemotron; VAD is the referee that detects when the classmate starts talking → triggers barge-in.

	- STT (interjections) — pick: Whisper, or Moonshine for speed.
	- `faster-whisper` / `distil-whisper` (large-v3 ~1.5B, or base/small for latency) — accurate, OpenAI open weights.
	- Moonshine (~tiny) — built for real-time/on-device, faster time-to-text on short clips.
	- Interjections are short → small model is fine; time-to-text, not accuracy, is the bottleneck. Start with
	`faster-whisper-base`, switch to Moonshine only if laggy.
	- TTS (narration) — pick: Kokoro, fallback Piper.
	- Kokoro-82M (`hexgrad/Kokoro-82M`) — 82M, good quality, fast time-to-first-audio, streamable. Sweet spot.
	- Piper — even lower latency, CPU-friendly, slightly more robotic. Use if speech layer isn't on GPU.
	- Stream sentence-by-sentence: synthesize + play each sentence as Nemotron emits it (audio ~1 sentence behind gen).
	- Barge-in / turn-taking — use FastRTC (`fastrtc`, the Gradio/HF WebRTC stack). Gives low-latency mic+playback
	over WebRTC, built-in VAD turn detection (`ReplyOnPause`), and the hook to cancel playback + generation the
	instant the user speaks. Avoids hand-rolling silence detection; reuses our cancellable-generation design.
	- Loop: narrate (TTS) → Silero VAD hears speech → kill TTS + cancel Nemotron → buffer to end-of-speech → STT →
	answer using the cached slide reading as context → TTS answer → resume narration.
	- Latency budget (what "feels live" needs): minimize time-to-first-audio. STT small (~100–300ms) + Nemotron
	MoE fast prefill + Kokoro first chunk (~tens of ms). Run STT/TTS/VAD on the same Modal GPU as Nemotron
	(or CPU for Piper+Silero) to avoid network hops.
	- Scope ladder (don't let voice eat the week):
	1. TTS narration only — Prof talks, classmate types to interject. Low risk, already feels real-time.
	2. Push-to-talk interject — hold key / tap to ask aloud → STT. No VAD/barge-in. ~90% of magic, ~30% of work.
	3. Full-duplex barge-in via FastRTC + VAD — only once 1–2 are solid. The 2→3 jump is where time goes.
	- Model count after voice: MiniCPM-V (vision) + Nemotron (brain) + Whisper/Moonshine (STT) + Kokoro/Piper (TTS)
	+ Silero VAD. Multi-model is explicitly blessed; more "appropriate model fit" surface, but more integration.

	## Agent loop, tools & slide grounding
	The brain (Nemotron) runs as a tool-using agent driving the lecture — professor-esque: decides which slide
	to be on, when to draw vs. just talk, when to jump back. (Strengthens the Best Agent award; the tool-call
	sequence is the publishable agent trace → Sharing is Caring badge.)

	Tools (mutate UI / session state):
	- `goto_slide(i)` / `next_slide()` / `prev_slide()` — navigation; lets it jump back to a referenced slide.
	- `look_closer(question)` — on-demand real-time MiniCPM-V call on the current slide image for detail.
	- `draw(mermaid \| excalidraw_json)` — render on the whiteboard surface.
	- `clear_whiteboard()`
	- `highlight_region(bbox)` — optional, later.
	- narration = the free-text part of the response → streamed to TTS.

	Control flow: orchestrator loop. Each turn the agent gets `{ current slide reading, deck outline, recent
	history, trigger }` where trigger = "continue lecture" \| "user asked: <q>". It returns narration + optional
	tool calls; the orchestrator executes tools (swap displayed slide, render whiteboard) and streams narration.
	Reserve heavy reasoning for decision points (between slides), not during narration, or latency balloons.

	Grounding — preprocess (breadth) + real-time (depth), both:
	- On upload (once; complete before lecture begins):
	- render each slide → image (serves both display and vision),
	- MiniCPM-V → cached structured slide reading (title, bullets, equations, diagram desc, key concepts),
	- PDF text-layer extraction (exact text ground-truth; vision misreads text) to complement vision,
	- build a deck outline / index (slide → title/concepts) — this is what lets the agent plan and pick a
	slide; it can't navigate to a slide it has never seen.
	- During lecture (`look_closer`): targeted MiniCPM-V look at the actual slide *with the specific question in
	context* — handles the long tail (specific visual questions, detail the cached summary glossed over).
	- Why both: preprocess = global map + fast/cheap narration + navigation; real-time = accuracy on specific asks.

	UI: show the REAL slide, never a summary.
	- Main surface = the actual rendered slide image / PDF page, synced to the agent's `current_slide`.
	- Whiteboard = a separate adjacent canvas (Mermaid/Excalidraw render) so drawings read as the prof's
	annotations, not edits to the original slide. (Region-highlight overlay on the slide can come later.)
	- Plus: live caption / transcript + mic / interject control.

	The detailed deployment, deck-cache, teaching-beat, interruption, speech, and whiteboard decisions are in
	[`ARCHITECTURE.md`](ARCHITECTURE.md).

	## Fine-tuning (after core pipeline — not before)
	Chosen direction: teaching-style SFT / guided learning — tune the brain (Nemotron) to explain like a good
	TA: analogies, checks for understanding, concise, no preamble. Bootstrap dataset = (slide reading → ideal
	explanation) pairs generated with a strong model + a little hand-curation; QLoRA via TRL / Unsloth on Modal;
	publish the adapter to HF → Well-Tuned badge (+ feeds Bonus Quest Champion). Put a before/after note in the
	model card. Strictly a step-2 enhancement — only once the live pipeline works and there's a clear objective.

	## Backburner ideas (TTW — only if a 2nd submission is feasible)
	- Emotion-driven TRPG / choose-your-own-adventure: free text → LLM updates structured NPC emotional
	state (trust/fear/affection…) → state steers the story. Best Agent + Sharing is Caring (emotion-delta traces).
	- AI Garlic Phone: message/drawing passed down a chain of model personas, mutating; needs a reveal payoff.
	If a drawing is passed, pulls in vision = OpenBMB again.
	- Sketch-to-story adventure: you draw your action, MiniCPM-V interprets the doodle, the world reacts.
	Fuses vision + emotion mechanic. Strong demo, only-AI-can-do-this.
	- Model-vs-model battle → traces for training (browser-brawl-style, different domain): most technically
	impressive (Best Agent + Well-Tuned + Sharing is Caring) but research-shaped; the training loop can eat the week.

	## Key model references
	- MiniCPM-V-4 (4.1B, multimodal): https://huggingface.co/openbmb/MiniCPM-V-4
	- MiniCPM-V-4 GGUF (llama.cpp): https://huggingface.co/openbmb/MiniCPM-V-4-gguf
	- Nemotron 3 Nano collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
	- Nemotron 3 Nano 4B: https://huggingface.co/blog/nvidia/nemotron-3-nano-4b
	- Nemotron 3 Nano 30B-A3B: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

	# Build Small Hackathon — Ideation & Decisions

	> Working notes from ideation. Project: AI Prof (primary submission).
	> Hackathon: https://huggingface.co/build-small-hackathon · Field guide: https://build-small-hackathon-field-guide.hf.space/

	## Hackathon constraints (the rules that shape everything)
	- Model size: ≤32B parameters PER MODEL (not aggregate). You can freely combine multiple
	small models as long as each one individually stays under the cap. ("not just active params")
	- No sponsor exclusivity — a sponsor's model just has to be a core part of the experience;
	you may mix in other providers' models.
	- Platform: must be built in Gradio, hosted as a Hugging Face Space.
	- Deliverables: working Space + demo video + social media post.
	- Timeline: hack window June 5–15, 2026.
	- Credits: OpenAI $100 is for Codex (their coding agent), not API/inference. Modal $250, HF $20.

	## Tracks
	- Backyard AI — solve a genuine problem for someone you personally know. Judged on specificity,
	actual user adoption, and appropriate model fit. ← our track.
	- Thousand Token Wood (TTW) — delightful/unconventional, AI must be load-bearing (game, story, art).

	## Prize surface (it stacks — one app can hit many)
	Main tracks ($18k): 1st–4th per track + Community Choice ($2k).
	Sponsor awards:
	- OpenBMB $10k — build with MiniCPM (incl. vision MiniCPM-V / omni MiniCPM-o).
	- OpenAI $10k — won via Codex-attributed commits in the repo/Space (build-tool track, model-agnostic).
	- NVIDIA — two RTX 5080 GPUs; build with Nemotron. One for "best space", one for community engagement (likes).
	- Modal $20k credits — use Modal for dev/runtime, note in README.
	Special awards ($8k): Bonus Quest Champion $2k · Off-Brand (best custom UI via `gr.Server`) $1.5k ·
	Tiny Titan (best app on ≤4B model) $1.5k · Best Demo $1k · Best Agent $1k · Judges' Wildcard $1k.
	Six merit badges: Off the Grid (no cloud APIs) · Well-Tuned (publish fine-tune) · Off-Brand (custom UI) ·
	Llama Champion (llama.cpp runtime) · Sharing is Caring (publish agent trace) · Field Notes (build blog).

	## Decision: build the AI Prof (Backyard AI)
	Problem (real): a specific classmate finds that having the slides isn't enough — they're static and
	lack the in-class explanation. Test with real, anonymized lecture slides from our class.

	Core loop: upload lecture PDF → model reads each slide as an image → explains it like a TA,
	streamed in real time → classmate can interject with a question at any moment.

	### Two-model architecture (stacks OpenBMB + NVIDIA, cleanly justified)
	- MiniCPM-V (~4.1B, GGUF) = the eyes. Reads each slide as an image (diagrams, equations, layout —
	not just scraped text). Core → satisfies OpenBMB. Runs via llama.cpp → Llama Champion badge.
	- Nemotron 3 Nano (9B, or 30B-A3B MoE = only ~3.6B active → fast decode) = the brain. Turns the
	slide reading into the spoken explanation and answers interjections. Reasoning/agentic → NVIDIA fit.
	- Per-model cap means no param-budget tension between the two. Division of labor (see/explain) survives
	the "core part of the experience" test → not sponsor-stuffing.

	### Prize map for this one app
	Backyard placement · OpenBMB $10k · NVIDIA RTX 5080 · OpenAI $10k (Codex commits) · Modal $20k (self-host) ·
	Off-Brand $1.5k+badge (custom whiteboard UI) · Llama Champion · Off the Grid · Sharing is Caring (publish a
	teaching-session trace) · Field Notes (blog) · Best Demo. → realistically 6+ surfaces.

	### Scope discipline (vertical slice order)
	1. Slide upload → MiniCPM-V reads → Nemotron explains, streamed in a simple custom UI. (Submittable alone.)
	2. Text interjection (pause, ask, resume).
	3. Whiteboard — model emits Mermaid / Excalidraw JSON (structured; small models do this reliably).
	Avoid freeform tldraw generation — that's the day-eating trap.
	4. Voice / TTS — only with slack.

	## Real-time architecture
	Goal: feel like a live lecture — explanation streams as if the prof is talking through each slide,
	smooth slide-to-slide, interruptible mid-sentence.
	- Token streaming: stream Nemotron output to the UI (Gradio generators) so it appears as it's generated.
	- Complete index before teaching: process the full deck before starting the lecture. The professor needs a
	global map to choose slides intelligently and answer interjections. Show preparation progress in Gradio.
	- Two-stage per slide, cached: (a) MiniCPM-V → a structured "slide reading" (text + diagram desc + equations),
	computed once and cached; (b) Nemotron → the explanation. Interjections reuse cached (a), never re-run vision.
	- Interjection = interrupt + branch: need a cancellable generation (async cancel token / threading.Event).
	On input: stop current stream, answer using the cached slide reading + history as context, then resume.
	- Streaming TTS (optional): chunk explanation into sentences, TTS each as generated, play sequentially.
	Barge-in (interrupt-on-speech) is hard mode — defer.
	- Session state: { slides[], current_index, cached_slide_readings{}, conversation_history }.
	- Hosting tradeoff for low latency: free HF Space hardware likely too slow for 2 models in real time.
	Lean: Modal GPU as inference backend (uses the Modal track) + Gradio Space as frontend. Self-hosting
	(not an external inference API) still supports the Off the Grid badge.

	## Voice (STT + TTS) — makes it feel like class, but it's the deepest rabbit hole
	Two more tiny, on-HF models (free under the ≤32B-per-model cap; FAQ uses "a 7B speech model" as its example).
	Keep both open + self-hosted (not ElevenLabs/Deepgram/OpenAI API) → protects the Off the Grid badge.

	Pipeline: TTS (out) speaks Nemotron's streamed explanation; STT (in) transcribes the spoken interjection
	into text for Nemotron; VAD is the referee that detects when the classmate starts talking → triggers barge-in.

	- STT (interjections) — pick: Whisper, or Moonshine for speed.
	- `faster-whisper` / `distil-whisper` (large-v3 ~1.5B, or base/small for latency) — accurate, OpenAI open weights.
	- Moonshine (~tiny) — built for real-time/on-device, faster time-to-text on short clips.
	- Interjections are short → small model is fine; time-to-text, not accuracy, is the bottleneck. Start with
	`faster-whisper-base`, switch to Moonshine only if laggy.
	- TTS (narration) — pick: Kokoro, fallback Piper.
	- Kokoro-82M (`hexgrad/Kokoro-82M`) — 82M, good quality, fast time-to-first-audio, streamable. Sweet spot.
	- Piper — even lower latency, CPU-friendly, slightly more robotic. Use if speech layer isn't on GPU.
	- Stream sentence-by-sentence: synthesize + play each sentence as Nemotron emits it (audio ~1 sentence behind gen).
	- Barge-in / turn-taking — use FastRTC (`fastrtc`, the Gradio/HF WebRTC stack). Gives low-latency mic+playback
	over WebRTC, built-in VAD turn detection (`ReplyOnPause`), and the hook to cancel playback + generation the
	instant the user speaks. Avoids hand-rolling silence detection; reuses our cancellable-generation design.
	- Loop: narrate (TTS) → Silero VAD hears speech → kill TTS + cancel Nemotron → buffer to end-of-speech → STT →
	answer using the cached slide reading as context → TTS answer → resume narration.
	- Latency budget (what "feels live" needs): minimize time-to-first-audio. STT small (~100–300ms) + Nemotron
	MoE fast prefill + Kokoro first chunk (~tens of ms). Run STT/TTS/VAD on the same Modal GPU as Nemotron
	(or CPU for Piper+Silero) to avoid network hops.
	- Scope ladder (don't let voice eat the week):
	1. TTS narration only — Prof talks, classmate types to interject. Low risk, already feels real-time.
	2. Push-to-talk interject — hold key / tap to ask aloud → STT. No VAD/barge-in. ~90% of magic, ~30% of work.
	3. Full-duplex barge-in via FastRTC + VAD — only once 1–2 are solid. The 2→3 jump is where time goes.
	- Model count after voice: MiniCPM-V (vision) + Nemotron (brain) + Whisper/Moonshine (STT) + Kokoro/Piper (TTS)
	+ Silero VAD. Multi-model is explicitly blessed; more "appropriate model fit" surface, but more integration.

	## Agent loop, tools & slide grounding
	The brain (Nemotron) runs as a tool-using agent driving the lecture — professor-esque: decides which slide
	to be on, when to draw vs. just talk, when to jump back. (Strengthens the Best Agent award; the tool-call
	sequence is the publishable agent trace → Sharing is Caring badge.)

	Tools (mutate UI / session state):
	- `goto_slide(i)` / `next_slide()` / `prev_slide()` — navigation; lets it jump back to a referenced slide.
	- `look_closer(question)` — on-demand real-time MiniCPM-V call on the current slide image for detail.
	- `draw(mermaid \| excalidraw_json)` — render on the whiteboard surface.
	- `clear_whiteboard()`
	- `highlight_region(bbox)` — optional, later.
	- narration = the free-text part of the response → streamed to TTS.

	Control flow: orchestrator loop. Each turn the agent gets `{ current slide reading, deck outline, recent
	history, trigger }` where trigger = "continue lecture" \| "user asked: <q>". It returns narration + optional
	tool calls; the orchestrator executes tools (swap displayed slide, render whiteboard) and streams narration.
	Reserve heavy reasoning for decision points (between slides), not during narration, or latency balloons.

	Grounding — preprocess (breadth) + real-time (depth), both:
	- On upload (once; complete before lecture begins):
	- render each slide → image (serves both display and vision),
	- MiniCPM-V → cached structured slide reading (title, bullets, equations, diagram desc, key concepts),
	- PDF text-layer extraction (exact text ground-truth; vision misreads text) to complement vision,
	- build a deck outline / index (slide → title/concepts) — this is what lets the agent plan and pick a
	slide; it can't navigate to a slide it has never seen.
	- During lecture (`look_closer`): targeted MiniCPM-V look at the actual slide *with the specific question in
	context* — handles the long tail (specific visual questions, detail the cached summary glossed over).
	- Why both: preprocess = global map + fast/cheap narration + navigation; real-time = accuracy on specific asks.

	UI: show the REAL slide, never a summary.
	- Main surface = the actual rendered slide image / PDF page, synced to the agent's `current_slide`.
	- Whiteboard = a separate adjacent canvas (Mermaid/Excalidraw render) so drawings read as the prof's
	annotations, not edits to the original slide. (Region-highlight overlay on the slide can come later.)
	- Plus: live caption / transcript + mic / interject control.

	The detailed deployment, deck-cache, teaching-beat, interruption, speech, and whiteboard decisions are in
	[`ARCHITECTURE.md`](ARCHITECTURE.md).

	## Fine-tuning (after core pipeline — not before)
	Chosen direction: teaching-style SFT / guided learning — tune the brain (Nemotron) to explain like a good
	TA: analogies, checks for understanding, concise, no preamble. Bootstrap dataset = (slide reading → ideal
	explanation) pairs generated with a strong model + a little hand-curation; QLoRA via TRL / Unsloth on Modal;
	publish the adapter to HF → Well-Tuned badge (+ feeds Bonus Quest Champion). Put a before/after note in the
	model card. Strictly a step-2 enhancement — only once the live pipeline works and there's a clear objective.

	## Backburner ideas (TTW — only if a 2nd submission is feasible)
	- Emotion-driven TRPG / choose-your-own-adventure: free text → LLM updates structured NPC emotional
	state (trust/fear/affection…) → state steers the story. Best Agent + Sharing is Caring (emotion-delta traces).
	- AI Garlic Phone: message/drawing passed down a chain of model personas, mutating; needs a reveal payoff.
	If a drawing is passed, pulls in vision = OpenBMB again.
	- Sketch-to-story adventure: you draw your action, MiniCPM-V interprets the doodle, the world reacts.
	Fuses vision + emotion mechanic. Strong demo, only-AI-can-do-this.
	- Model-vs-model battle → traces for training (browser-brawl-style, different domain): most technically
	impressive (Best Agent + Well-Tuned + Sharing is Caring) but research-shaped; the training loop can eat the week.

	## Key model references
	- MiniCPM-V-4 (4.1B, multimodal): https://huggingface.co/openbmb/MiniCPM-V-4
	- MiniCPM-V-4 GGUF (llama.cpp): https://huggingface.co/openbmb/MiniCPM-V-4-gguf
	- Nemotron 3 Nano collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
	- Nemotron 3 Nano 4B: https://huggingface.co/blog/nvidia/nemotron-3-nano-4b
	- Nemotron 3 Nano 30B-A3B: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16