Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
A1 β "Third Eye" Β· Master Win Plan
HF Build Small Hackathon 2026 Β· Backyard AI Track
This is the strategy + handoff document. The build itself is driven by
AGENT_PROMPT.md(written for a smaller model). This file is for you (the human) + as context for any agent.
0. One-line pitch
Third Eye β point a webcam at anything, speak a question in your own language, and hear the answer back. A fully voice-driven "second sight" for blind and low-vision people, running on sub-3B sponsor models small enough to live on an edge device.
Why it wins: it is the rare hackathon entry that is emotionally undeniable (a blind person hearing a menu read aloud), technically tidy (tiny models, clean pipeline), and visually striking (a futuristic voice-first UI that doubles as a WCAG-AA accessibility layer). That one project legitimately reaches across 6+ award tracks at once.
1. Track strategy β how ONE build collects MANY prizes
The whole point of this plan: do not chase tracks separately. Build one coherent app where each feature automatically satisfies a track. Map below β every row must have a concrete artifact.
| Track / Award | What unlocks it | Concrete artifact we ship |
|---|---|---|
| Backyard AI podium | Underserved real-world use-case, real user impact | Accessibility app + a real first-person demo (you, eyes closed, using it) |
| OpenBMB award | Use OpenBMB models | MiniCPM-V (vision/OCR) + VoxCPM (TTS) = two OpenBMB models |
| Cohere award | Use a Cohere model | Cohere Transcribe = the STT stage; multilingual = Cohere's sweet spot |
| Tiny Titan ($1,500) | Primary model β€ ~4B params | MiniCPM-V-2 primary at 2.8B β under the cap |
| Best Demo ($1,000) | A 30β60s demo that lands | Scripted video: blind-POV menu read aloud in Hindi |
| Off-Brand ($1,500) | Custom UI, not stock Gradio | Custom CSS + "Iris" voice-first design system (see Β§4) |
| Field Notes (badge) | Public write-up | HF blog post: "What VLM quality really feels like at 2.8B" |
| Edge / on-device angle | Runs small enough for hardware | GGUF int4 roadmap + on-device claim (see Β§6) |
Rule of thumb for the builder: if a feature doesn't move a track forward or keep the core pipeline alive, it is out of scope for the hackathon window.
2. Scope discipline (what to build, in order)
The single biggest risk in a hackathon is a half-working everything. We build a vertical slice first (one path that works end-to-end), then widen.
MUST-HAVE (submission is invalid without these):
- Webcam image capture in the browser.
- Vision pipeline: image + question β text answer (MiniCPM-V on Modal GPU).
- TTS: answer text β spoken audio that auto-plays (VoxCPM on Modal GPU).
- The "Describe" path working end-to-end without typing.
- 3 bundled example images so judges with no webcam can still test.
- Graceful failures everywhere (never show a raw traceback).
SHOULD-HAVE (these win the extra tracks): 7. "Ask" path: mic β Cohere Transcribe β vision β TTS (the zero-typing loop). 8. "Read Text" path: fixed OCR prompt. 9. Custom "Iris" UI + custom CSS (Off-Brand). 10. Language selector β multilingual TTS (Cohere multilingual story).
NICE-TO-HAVE (only if time remains):
11. Bounding-box "zoom & read" tab (gr.ImageEditor).
12. Model-weight caching on a Modal Volume for fast cold starts.
13. On-device GGUF proof-of-concept note + benchmark table.
Ship MUST-HAVE before touching SHOULD-HAVE. Ship SHOULD-HAVE before NICE-TO-HAVE. No exceptions.
3. Architecture
βββββββββββββββββββββββββββββ HF Space (Gradio 5.x) βββββββββββββββββββββββββββββ
β Browser: webcam frame + mic audio + language choice β
β β β
β βΌ β
β app.py ββ(audio bytes)βββΊ Cohere Transcribe β
β β (STT, 2B) βββ runs on Modal βββ β
β β ββββββββββ question text ββββββββββββββββββββββββββββββ β
β β β
β βββ(image bytes + question)βββΊ MiniCPM-V (2.8B vision/OCR) ββ Modal A10G βββ β
β β ββββββββββββββββββ answer text ββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββ(answer text + lang)βββΊ VoxCPM (2B TTS) ββ Modal A10G βββ β
β β ββββββββββββββββββ WAV bytes ββββββββββββββββββββββββββββ β
β βΌ β
β auto-playing audio + large-text transcript + ARIA live announcements β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Frontend / orchestration: Gradio 5.x
gr.Blockson a Hugging Face Space. - Heavy compute: Modal serverless GPU (A10G). Vision, STT, and TTS all run as Modal functions.
- Why split this way: HF Spaces free tier can't comfortably hold 3 models hot; Modal gives on-demand GPU and keeps the Space light. The Space holds the token to call Modal.
- No other cloud APIs. Only sponsor models. This keeps "off-grid-ish" eligibility intact (note in README that Modal is infra, not a model provider).
Files
third-eye/
βββ app.py # Gradio UI + pipeline orchestration
βββ modal_backend.py # Modal app: vision + TTS (+ STT) functions, GPU
βββ cohere_stt.py # Cohere Transcribe wrapper (called via Modal)
βββ utils.py # image<->bytes, bytes<->wav helpers, safe-call wrapper
βββ requirements.txt
βββ .env.example # MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, HF_TOKEN
βββ README.md # HF Space frontmatter + story + edge-device section
βββ BLOG.md # Field Notes draft (publish to HF blog)
βββ DEMO_SCRIPT.md # 45-sec shot list for Best Demo
βββ assets/
βββ custom.css # "Iris" design system, WCAG-AA, prefers-reduced-motion
βββ sample_menu.jpg
βββ sample_label.jpg
βββ sample_sign.jpg
4. Futuristic UI β the "Iris" design system (Off-Brand track)
The narrative that wins Off-Brand: "We designed for people who can't see the screen β and the visual layer still looks like it's from 2035." Accessibility constraints (high contrast, huge targets, clear focus) are turned into a deliberate, futuristic aesthetic, not an afterthought.
Concept: the app is an eye. A single glowing iris orb sits center-stage and changes state as the pipeline runs. The user never hunts for controls β there is essentially one big action, voice-first.
State machine (the orb visibly + audibly reflects each):
IDLE (slow breathing glow) β LISTENING (reactive ring / waveform) β SEEING (scan-line sweeps
the captured frame) β THINKING (orb tightens, pulses faster) β SPEAKING (waveform from the
orb, audio auto-plays) β back to IDLE. Each transition fires an ARIA live announcement and a
short audio cue, so a blind user tracks state by ear.
Visual language:
- Void background
#06070Awith a faint radial vignette + subtle film grain. - Accent: indigoβcyan gradient (
#5B7CFA β #3DE0FF). Glow via layeredbox-shadow. - Text:
#F5F7FA, base 20px, outputs 24px+, line-height 1.7. Contrast β₯ WCAG AA. - Surfaces: glassmorphism panels β
backdrop-filter: blur, 1px hairline border, soft inner glow. - Primary control: one large circular "π Tap / Speak" button, min 96px target, neon focus ring.
- Motion: breathing/pulse on the orb; scan-line on image; ALL motion gated behind
@media (prefers-reduced-motion: reduce). - Focus rings: thick cyan glow β serves keyboard users and the futuristic look at once.
- Typography: Inter / system stack; tabular, calm, generous spacing.
Voice-first interaction rules:
- Audio answer auto-plays the moment it's ready (no "press play").
- Big targets, full-width controls, logical tab order, every control labeled for screen readers.
- A persistent ARIA
live=politestatus line mirrors the orb state in words. - Mic is the primary input; the typed
gr.Textboxexists only as a never-block fallback.
Implementation reality: Gradio theming + a hand-written
assets/custom.css. The orb + states are CSS (gradients, keyframes,box-shadow) driven by a CSS class thatapp.pyswaps via Gradioelem_classes/updates. Keep it pure-CSS so a smaller model can build it reliably. Detailed CSS targets are spelled out inAGENT_PROMPT.md.
5. Phased build plan (what the agent does, milestone by milestone)
Each phase ends with a verifiable checkpoint. The agent must not advance until the checkpoint
passes. (Full instructions live in AGENT_PROMPT.md; this is the human-readable map.)
- Phase 0 β Verify reality (CRITICAL, do first). Before writing inference code, open each model card (MiniCPM-V-2, VoxCPM, Cohere Transcribe) and confirm the exact load + call API and the exact model IDs. These APIs differ by version. Use the model card, not assumptions. If an ID doesn't resolve, stop and report β do not silently substitute.
- Phase 1 β Scaffold. Create files,
requirements.txt,.env.example, stub functions that return fake data. Get the Gradio UI rendering with the Iris layout and the 3 tabs. Checkpoint:python app.pylaunches locally and the UI loads with placeholder responses. - Phase 2 β Modal vision. Implement
describe_sceneon Modal; deploy; call it from a tiny test script with a sample image. Checkpoint: a real text description comes back forsample_menu.jpg. - Phase 3 β Modal TTS. Implement
speak; return WAV bytes; play locally. Checkpoint: a WAV file plays intelligible speech of a known sentence. - Phase 4 β Wire "Describe" end-to-end. Image β vision β TTS β auto-play in the UI. Checkpoint: pick a sample image in the Space, hear it described. This is the minimum valid submission.
- Phase 5 β STT + "Ask". Cohere Transcribe β vision β TTS. Checkpoint: speak a question, hear an answer, zero typing.
- Phase 6 β "Read Text" + language selector + custom CSS polish. Checkpoint: OCR path works; Hindi TTS works; UI matches the Iris spec.
- Phase 7 β Hardening. Cold-start progress, every-exception
gr.Warning, mic/TTS fallbacks, bundle 3 examples. Checkpoint: kill the mic, kill TTS β app degrades gracefully, never crashes. - Phase 8 β Submission assets. README frontmatter, BLOG.md, DEMO_SCRIPT.md, push to Space.
- Phase 9 (if time) β NICE-TO-HAVE. Bounding-box tab, Modal Volume weight cache, GGUF note.
6. Edge-device / on-device story (a differentiator, used carefully)
This is a claim + roadmap, not a required runtime path β frame it honestly so it strengthens the entry instead of inviting "it doesn't actually run on a phone" pushback.
- The chosen models are deliberately tiny: MiniCPM-V-2 (2.8B), VoxCPM (~2B), Cohere Transcribe (2B). MiniCPM-V is well known to run on-device when quantized to int4 (GGUF / llama.cpp), including on phones.
- Narrative: "We run on Modal GPU for the hackathon demo, but the entire model stack is small enough to ship offline, on-device β which is exactly what a blind user wants: private, no-connectivity assistance that never sends their medicine labels to a server."
- Artifact (NICE-TO-HAVE): a short README section + a benchmark-style table (model, params, int4 size, target device) and a one-paragraph "On-device roadmap." If time allows, a tiny llama.cpp/GGUF screenshot of MiniCPM-V running locally. If not, keep it as a stated roadmap β do not claim a working phone build you didn't produce.
7. Risk register + reality checks (read before building)
| Risk | Likelihood | Mitigation |
|---|---|---|
| Model IDs / APIs differ from the prompt's guesses | High | Phase 0: verify every model card first; never assume the .chat()/.synthesize() signature |
model.chat(image=None, msgs=[{content:[image, prompt]}]) vs chat(image=image, msgs=[{content:prompt}]) differs by MiniCPM-V version |
High | Follow the exact model card for the chosen version; test with one image before wiring UI |
VoxCPM synthesize API guessed (model.synthesize) |
High | Read VoxCPM card; it may need a separate generate call / reference audio. Verify before Phase 3 |
Cohere Transcribe not a plain transformers ASR pipeline |
Med | Verify; if it needs a custom call, wrap it. Keep a typed-question fallback so STT failure never blocks |
| Modal cold start downloads GBs of weights β slow first call | High | Cache weights on a modal.Volume (Phase 9) + gr.Progress("first run ~30s") |
| Space can't reach Modal (no token) | Med | Put MODAL_TOKEN_ID/SECRET in Space secrets; document in README + .env.example |
| 2.8B VLM description quality too weak | Med | Documented fallback to MiniCPM-V-4_5 (8B) β note "Tiny Titan badge forfeited" in README; never silent-swap |
| Trying to build everything β nothing works | High | Strict MUSTβSHOULDβNICE ordering (Β§2); vertical slice first |
| Showing tracebacks to judges | Med | Wrap every stage; gr.Warning only; safe-call helper in utils |
8. Master checklist (tick before you submit)
Pipeline
- Webcam capture works in-browser
- Image β MiniCPM-V β text answer works
- Text β VoxCPM β WAV, auto-plays
- Mic β Cohere Transcribe β question text works
- "Describe" / "Ask" / "Read Text" all function end-to-end
- Language selector drives multilingual TTS (β₯ English + Hindi)
Resilience
- Cold-start progress indicator shown
- Mic failure β typed
gr.Textboxfallback - TTS failure β large-text output fallback
- Every exception surfaces as
gr.Warning, never a raw traceback - 3 example images bundled and selectable without a webcam
- Full pipeline tested imageβSTTβVLMβTTSβplayback on a fresh load
Tracks / submission
- README frontmatter copied verbatim (tags include both OpenBMB models + Cohere + tiny-titan + off-brand)
- Custom CSS / Iris UI live (Off-Brand)
- 45-sec demo video recorded (Best Demo)
- Field Notes blog drafted/published (badge)
- Edge/on-device section in README (claim + roadmap, honest)
- Space builds clean and loads on a cold visit
- Param budget noted: 2.8B primary β€ 4B cap (Tiny Titan)
9. README content plan
Sections, in order:
- Frontmatter β copy verbatim from
AGENT_PROMPT.md(title, emoji, sdk, tags). - What it is β 2 sentences, the pitch.
- Who it's for β blind / low-vision; the zero-typing promise.
- How to use β pick image or webcam β speak β listen. With the 3 examples.
- Models & sizes β table (role, model, params, sponsor); call out 2.8B Tiny Titan.
- Architecture β the diagram from Β§3, one paragraph.
- On-device / edge β the honest claim + roadmap (Β§6).
- Accessibility & design β Iris design system, WCAG-AA, voice-first.
- Run it yourself β env vars, Modal deploy, Space secrets.
- Credits / sponsors β OpenBMB, Cohere, Modal, HF.
10. Best-Demo video script (45β60s)
- 0β5s: black screen, eye opens (the orb). Text: "What if your phone could see for you?"
- 5β20s: first-person, eyes closed / blindfold. Hold phone/webcam to a restaurant menu. Tap.
- 20β35s: orb cycles ListeningβSeeingβThinkingβSpeaking; audio reads the menu aloud β in Hindi.
- 35β50s: quick cuts β medicine label read aloud, street sign read aloud.
- 50β60s: "Third Eye. Built on 2.8B params. Small enough to live on your phone." Logo + orb.
Keep it real, one continuous-feeling take, minimal text, let the audio answer be the hero moment.
11. How to use these two files
- Read this
PLAN.mdto hold the strategy in your head. - Open
AGENT_PROMPT.md, copy the whole thing, paste into Codex / OpenCode as the task. - That prompt makes the smaller model do Phase 0 verification first, then build phase-by-phase with checkpoints β so it can't run ahead and hallucinate model APIs.
- When it reports a model-card mismatch, that's expected β let it adapt to the real API and continue.