Spaces:
Sleeping
Sleeping
| # A1 — "Third Eye" · Master Win Plan | |
| ### HF Build Small Hackathon 2026 · Backyard AI Track | |
| > This is the strategy + handoff document. The build itself is driven by `AGENT_PROMPT.md` | |
| > (written for a smaller model). This file is for **you** (the human) + as context for any agent. | |
| --- | |
| ## 0. One-line pitch | |
| **Third Eye** — point a webcam at anything, speak a question in your own language, and hear the | |
| answer back. A fully voice-driven "second sight" for blind and low-vision people, running on | |
| sub-3B sponsor models small enough to live on an edge device. | |
| Why it wins: it is the rare hackathon entry that is **emotionally undeniable** (a blind person | |
| hearing a menu read aloud), **technically tidy** (tiny models, clean pipeline), and **visually | |
| striking** (a futuristic voice-first UI that doubles as a WCAG-AA accessibility layer). That one | |
| project legitimately reaches across **6+ award tracks** at once. | |
| --- | |
| ## 1. Track strategy — how ONE build collects MANY prizes | |
| The whole point of this plan: do not chase tracks separately. Build one coherent app where each | |
| feature *automatically* satisfies a track. Map below — every row must have a concrete artifact. | |
| | Track / Award | What unlocks it | Concrete artifact we ship | | |
| |---|---|---| | |
| | **Backyard AI podium** | Underserved real-world use-case, real user impact | Accessibility app + a real first-person demo (you, eyes closed, using it) | | |
| | **OpenBMB award** | Use OpenBMB models | MiniCPM-V (vision/OCR) **+** VoxCPM (TTS) = two OpenBMB models | | |
| | **Cohere award** | Use a Cohere model | Cohere Transcribe = the STT stage; multilingual = Cohere's sweet spot | | |
| | **Tiny Titan ($1,500)** | Primary model ≤ ~4B params | MiniCPM-V-2 primary at 2.8B → under the cap | | |
| | **Best Demo ($1,000)** | A 30–60s demo that lands | Scripted video: blind-POV menu read aloud in Hindi | | |
| | **Off-Brand ($1,500)** | Custom UI, not stock Gradio | Custom CSS + "Iris" voice-first design system (see §4) | | |
| | **Field Notes (badge)** | Public write-up | HF blog post: "What VLM quality really feels like at 2.8B" | | |
| | **Edge / on-device angle** | Runs small enough for hardware | GGUF int4 roadmap + on-device claim (see §6) | | |
| **Rule of thumb for the builder:** if a feature doesn't move a track forward or keep the core | |
| pipeline alive, it is out of scope for the hackathon window. | |
| --- | |
| ## 2. Scope discipline (what to build, in order) | |
| The single biggest risk in a hackathon is a half-working everything. We build a **vertical slice | |
| first** (one path that works end-to-end), then widen. | |
| **MUST-HAVE (submission is invalid without these):** | |
| 1. Webcam image capture in the browser. | |
| 2. Vision pipeline: image + question → text answer (MiniCPM-V on Modal GPU). | |
| 3. TTS: answer text → spoken audio that auto-plays (VoxCPM on Modal GPU). | |
| 4. The "Describe" path working end-to-end without typing. | |
| 5. 3 bundled example images so judges with no webcam can still test. | |
| 6. Graceful failures everywhere (never show a raw traceback). | |
| **SHOULD-HAVE (these win the extra tracks):** | |
| 7. "Ask" path: mic → Cohere Transcribe → vision → TTS (the zero-typing loop). | |
| 8. "Read Text" path: fixed OCR prompt. | |
| 9. Custom "Iris" UI + custom CSS (Off-Brand). | |
| 10. Language selector → multilingual TTS (Cohere multilingual story). | |
| **NICE-TO-HAVE (only if time remains):** | |
| 11. Bounding-box "zoom & read" tab (`gr.ImageEditor`). | |
| 12. Model-weight caching on a Modal Volume for fast cold starts. | |
| 13. On-device GGUF proof-of-concept note + benchmark table. | |
| > Ship MUST-HAVE before touching SHOULD-HAVE. Ship SHOULD-HAVE before NICE-TO-HAVE. No exceptions. | |
| --- | |
| ## 3. Architecture | |
| ``` | |
| ┌──────────────────────────── HF Space (Gradio 5.x) ────────────────────────────┐ | |
| │ Browser: webcam frame + mic audio + language choice │ | |
| │ │ │ | |
| │ ▼ │ | |
| │ app.py ──(audio bytes)──► Cohere Transcribe │ | |
| │ │ (STT, 2B) ─── runs on Modal ──┐ │ | |
| │ │ ◄───────── question text ─────────────────────────────┘ │ | |
| │ │ │ | |
| │ ├──(image bytes + question)──► MiniCPM-V (2.8B vision/OCR) ── Modal A10G ──┐ │ | |
| │ │ ◄───────────────── answer text ─────────────────────────────────────────┘ │ | |
| │ │ │ | |
| │ ├──(answer text + lang)──► VoxCPM (2B TTS) ── Modal A10G ──┐ │ | |
| │ │ ◄───────────────── WAV bytes ───────────────────────────┘ │ | |
| │ ▼ │ | |
| │ auto-playing audio + large-text transcript + ARIA live announcements │ | |
| └─────────────────────────────────────────────────────────────────────────────────┘ | |
| ``` | |
| - **Frontend / orchestration:** Gradio 5.x `gr.Blocks` on a Hugging Face Space. | |
| - **Heavy compute:** Modal serverless GPU (A10G). Vision, STT, and TTS all run as Modal functions. | |
| - **Why split this way:** HF Spaces free tier can't comfortably hold 3 models hot; Modal gives | |
| on-demand GPU and keeps the Space light. The Space holds the token to call Modal. | |
| - **No other cloud APIs.** Only sponsor models. This keeps "off-grid-ish" eligibility intact | |
| (note in README that Modal is infra, not a model provider). | |
| ### Files | |
| ``` | |
| third-eye/ | |
| ├── app.py # Gradio UI + pipeline orchestration | |
| ├── modal_backend.py # Modal app: vision + TTS (+ STT) functions, GPU | |
| ├── cohere_stt.py # Cohere Transcribe wrapper (called via Modal) | |
| ├── utils.py # image<->bytes, bytes<->wav helpers, safe-call wrapper | |
| ├── requirements.txt | |
| ├── .env.example # MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, HF_TOKEN | |
| ├── README.md # HF Space frontmatter + story + edge-device section | |
| ├── BLOG.md # Field Notes draft (publish to HF blog) | |
| ├── DEMO_SCRIPT.md # 45-sec shot list for Best Demo | |
| └── assets/ | |
| ├── custom.css # "Iris" design system, WCAG-AA, prefers-reduced-motion | |
| ├── sample_menu.jpg | |
| ├── sample_label.jpg | |
| └── sample_sign.jpg | |
| ``` | |
| --- | |
| ## 4. Futuristic UI — the "Iris" design system (Off-Brand track) | |
| The narrative that wins Off-Brand: *"We designed for people who can't see the screen — and the | |
| visual layer still looks like it's from 2035."* Accessibility constraints (high contrast, huge | |
| targets, clear focus) are turned into a deliberate, futuristic aesthetic, not an afterthought. | |
| **Concept:** the app *is* an eye. A single glowing **iris orb** sits center-stage and changes | |
| state as the pipeline runs. The user never hunts for controls — there is essentially one big | |
| action, voice-first. | |
| **State machine (the orb visibly + audibly reflects each):** | |
| `IDLE` (slow breathing glow) → `LISTENING` (reactive ring / waveform) → `SEEING` (scan-line sweeps | |
| the captured frame) → `THINKING` (orb tightens, pulses faster) → `SPEAKING` (waveform from the | |
| orb, audio auto-plays) → back to `IDLE`. Each transition fires an ARIA live announcement **and** a | |
| short audio cue, so a blind user tracks state by ear. | |
| **Visual language:** | |
| - **Void background** `#06070A` with a faint radial vignette + subtle film grain. | |
| - **Accent:** indigo→cyan gradient (`#5B7CFA → #3DE0FF`). Glow via layered `box-shadow`. | |
| - **Text:** `#F5F7FA`, base **20px**, outputs **24px+**, line-height 1.7. Contrast ≥ WCAG AA. | |
| - **Surfaces:** glassmorphism panels — `backdrop-filter: blur`, 1px hairline border, soft inner glow. | |
| - **Primary control:** one large circular "👁 Tap / Speak" button, min 96px target, neon focus ring. | |
| - **Motion:** breathing/pulse on the orb; scan-line on image; ALL motion gated behind | |
| `@media (prefers-reduced-motion: reduce)`. | |
| - **Focus rings:** thick cyan glow — serves keyboard users *and* the futuristic look at once. | |
| - **Typography:** Inter / system stack; tabular, calm, generous spacing. | |
| **Voice-first interaction rules:** | |
| - Audio answer **auto-plays** the moment it's ready (no "press play"). | |
| - Big targets, full-width controls, logical tab order, every control labeled for screen readers. | |
| - A persistent ARIA `live=polite` status line mirrors the orb state in words. | |
| - Mic is the primary input; the typed `gr.Textbox` exists only as a never-block fallback. | |
| > Implementation reality: Gradio theming + a hand-written `assets/custom.css`. The orb + states are | |
| > CSS (gradients, keyframes, `box-shadow`) driven by a CSS class that `app.py` swaps via Gradio | |
| > `elem_classes`/updates. Keep it pure-CSS so a smaller model can build it reliably. Detailed CSS | |
| > targets are spelled out in `AGENT_PROMPT.md`. | |
| --- | |
| ## 5. Phased build plan (what the agent does, milestone by milestone) | |
| Each phase ends with a **verifiable** checkpoint. The agent must not advance until the checkpoint | |
| passes. (Full instructions live in `AGENT_PROMPT.md`; this is the human-readable map.) | |
| - **Phase 0 — Verify reality (CRITICAL, do first).** Before writing inference code, open each model | |
| card (MiniCPM-V-2, VoxCPM, Cohere Transcribe) and confirm the **exact** load + call API and the | |
| exact model IDs. These APIs differ by version. Use the model card, not assumptions. If an ID | |
| doesn't resolve, stop and report — do not silently substitute. | |
| - **Phase 1 — Scaffold.** Create files, `requirements.txt`, `.env.example`, stub functions that | |
| return fake data. Get the Gradio UI rendering with the Iris layout and the 3 tabs. Checkpoint: | |
| `python app.py` launches locally and the UI loads with placeholder responses. | |
| - **Phase 2 — Modal vision.** Implement `describe_scene` on Modal; deploy; call it from a tiny test | |
| script with a sample image. Checkpoint: a real text description comes back for `sample_menu.jpg`. | |
| - **Phase 3 — Modal TTS.** Implement `speak`; return WAV bytes; play locally. Checkpoint: a WAV | |
| file plays intelligible speech of a known sentence. | |
| - **Phase 4 — Wire "Describe" end-to-end.** Image → vision → TTS → auto-play in the UI. Checkpoint: | |
| pick a sample image in the Space, hear it described. **This is the minimum valid submission.** | |
| - **Phase 5 — STT + "Ask".** Cohere Transcribe → vision → TTS. Checkpoint: speak a question, hear | |
| an answer, zero typing. | |
| - **Phase 6 — "Read Text" + language selector + custom CSS polish.** Checkpoint: OCR path works; | |
| Hindi TTS works; UI matches the Iris spec. | |
| - **Phase 7 — Hardening.** Cold-start progress, every-exception `gr.Warning`, mic/TTS fallbacks, | |
| bundle 3 examples. Checkpoint: kill the mic, kill TTS — app degrades gracefully, never crashes. | |
| - **Phase 8 — Submission assets.** README frontmatter, BLOG.md, DEMO_SCRIPT.md, push to Space. | |
| - **Phase 9 (if time) — NICE-TO-HAVE.** Bounding-box tab, Modal Volume weight cache, GGUF note. | |
| --- | |
| ## 6. Edge-device / on-device story (a differentiator, used carefully) | |
| This is a **claim + roadmap**, not a required runtime path — frame it honestly so it strengthens | |
| the entry instead of inviting "it doesn't actually run on a phone" pushback. | |
| - The chosen models are deliberately tiny: MiniCPM-V-2 (2.8B), VoxCPM (~2B), Cohere Transcribe (2B). | |
| MiniCPM-V is well known to run **on-device** when quantized to int4 (GGUF / llama.cpp), including | |
| on phones. | |
| - **Narrative:** "We run on Modal GPU for the hackathon demo, but the entire model stack is small | |
| enough to ship **offline, on-device** — which is exactly what a blind user wants: private, | |
| no-connectivity assistance that never sends their medicine labels to a server." | |
| - **Artifact (NICE-TO-HAVE):** a short README section + a benchmark-style table (model, params, | |
| int4 size, target device) and a one-paragraph "On-device roadmap." If time allows, a tiny | |
| llama.cpp/GGUF screenshot of MiniCPM-V running locally. If not, keep it as a stated roadmap — | |
| do **not** claim a working phone build you didn't produce. | |
| --- | |
| ## 7. Risk register + reality checks (read before building) | |
| | Risk | Likelihood | Mitigation | | |
| |---|---|---| | |
| | Model IDs / APIs differ from the prompt's guesses | **High** | Phase 0: verify every model card first; never assume the `.chat()`/`.synthesize()` signature | | |
| | `model.chat(image=None, msgs=[{content:[image, prompt]}])` vs `chat(image=image, msgs=[{content:prompt}])` differs by MiniCPM-V version | High | Follow the exact model card for the chosen version; test with one image before wiring UI | | |
| | VoxCPM synthesize API guessed (`model.synthesize`) | High | Read VoxCPM card; it may need a separate generate call / reference audio. Verify before Phase 3 | | |
| | Cohere Transcribe not a plain `transformers` ASR pipeline | Med | Verify; if it needs a custom call, wrap it. Keep a typed-question fallback so STT failure never blocks | | |
| | Modal cold start downloads GBs of weights → slow first call | High | Cache weights on a `modal.Volume` (Phase 9) + `gr.Progress("first run ~30s")` | | |
| | Space can't reach Modal (no token) | Med | Put `MODAL_TOKEN_ID/SECRET` in Space secrets; document in README + `.env.example` | | |
| | 2.8B VLM description quality too weak | Med | Documented fallback to MiniCPM-V-4_5 (8B) — note "Tiny Titan badge forfeited" in README; never silent-swap | | |
| | Trying to build everything → nothing works | High | Strict MUST→SHOULD→NICE ordering (§2); vertical slice first | | |
| | Showing tracebacks to judges | Med | Wrap every stage; `gr.Warning` only; safe-call helper in utils | | |
| --- | |
| ## 8. Master checklist (tick before you submit) | |
| **Pipeline** | |
| - [ ] Webcam capture works in-browser | |
| - [ ] Image → MiniCPM-V → text answer works | |
| - [ ] Text → VoxCPM → WAV, auto-plays | |
| - [ ] Mic → Cohere Transcribe → question text works | |
| - [ ] "Describe" / "Ask" / "Read Text" all function end-to-end | |
| - [ ] Language selector drives multilingual TTS (≥ English + Hindi) | |
| **Resilience** | |
| - [ ] Cold-start progress indicator shown | |
| - [ ] Mic failure → typed `gr.Textbox` fallback | |
| - [ ] TTS failure → large-text output fallback | |
| - [ ] Every exception surfaces as `gr.Warning`, never a raw traceback | |
| - [ ] 3 example images bundled and selectable without a webcam | |
| - [ ] Full pipeline tested image→STT→VLM→TTS→playback on a fresh load | |
| **Tracks / submission** | |
| - [ ] README frontmatter copied verbatim (tags include both OpenBMB models + Cohere + tiny-titan + off-brand) | |
| - [ ] Custom CSS / Iris UI live (Off-Brand) | |
| - [ ] 45-sec demo video recorded (Best Demo) | |
| - [ ] Field Notes blog drafted/published (badge) | |
| - [ ] Edge/on-device section in README (claim + roadmap, honest) | |
| - [ ] Space builds clean and loads on a cold visit | |
| - [ ] Param budget noted: 2.8B primary ≤ 4B cap (Tiny Titan) | |
| --- | |
| ## 9. README content plan | |
| Sections, in order: | |
| 1. **Frontmatter** — copy verbatim from `AGENT_PROMPT.md` (title, emoji, sdk, tags). | |
| 2. **What it is** — 2 sentences, the pitch. | |
| 3. **Who it's for** — blind / low-vision; the zero-typing promise. | |
| 4. **How to use** — pick image or webcam → speak → listen. With the 3 examples. | |
| 5. **Models & sizes** — table (role, model, params, sponsor); call out 2.8B Tiny Titan. | |
| 6. **Architecture** — the diagram from §3, one paragraph. | |
| 7. **On-device / edge** — the honest claim + roadmap (§6). | |
| 8. **Accessibility & design** — Iris design system, WCAG-AA, voice-first. | |
| 9. **Run it yourself** — env vars, Modal deploy, Space secrets. | |
| 10. **Credits / sponsors** — OpenBMB, Cohere, Modal, HF. | |
| --- | |
| ## 10. Best-Demo video script (45–60s) | |
| 1. **0–5s:** black screen, eye opens (the orb). Text: "What if your phone could see for you?" | |
| 2. **5–20s:** first-person, eyes closed / blindfold. Hold phone/webcam to a restaurant menu. Tap. | |
| 3. **20–35s:** orb cycles Listening→Seeing→Thinking→Speaking; audio reads the menu aloud — **in Hindi**. | |
| 4. **35–50s:** quick cuts — medicine label read aloud, street sign read aloud. | |
| 5. **50–60s:** "Third Eye. Built on 2.8B params. Small enough to live on your phone." Logo + orb. | |
| Keep it real, one continuous-feeling take, minimal text, let the audio answer be the hero moment. | |
| --- | |
| ## 11. How to use these two files | |
| 1. Read this `PLAN.md` to hold the strategy in your head. | |
| 2. Open `AGENT_PROMPT.md`, copy the whole thing, paste into Codex / OpenCode as the task. | |
| 3. That prompt makes the smaller model do Phase 0 verification first, then build phase-by-phase | |
| with checkpoints — so it can't run ahead and hallucinate model APIs. | |
| 4. When it reports a model-card mismatch, that's expected — let it adapt to the real API and continue. | |
| ``` | |