Spaces:

build-small-hackathon
/

third-eye

Sleeping

App Files Files Community

third-eye / PLAN.md

mitvho09

Upload folder using huggingface_hub

031e3f9 verified 18 days ago

preview code

Raw

History Blame Contribute Delete

17.8 kB

	# A1 — "Third Eye" · Master Win Plan
	### HF Build Small Hackathon 2026 · Backyard AI Track
	> This is the strategy + handoff document. The build itself is driven by `AGENT_PROMPT.md`
	> (written for a smaller model). This file is for you (the human) + as context for any agent.

	---

	## 0. One-line pitch
	Third Eye — point a webcam at anything, speak a question in your own language, and hear the
	answer back. A fully voice-driven "second sight" for blind and low-vision people, running on
	sub-3B sponsor models small enough to live on an edge device.

	Why it wins: it is the rare hackathon entry that is emotionally undeniable (a blind person
	hearing a menu read aloud), technically tidy (tiny models, clean pipeline), and **visually
	striking** (a futuristic voice-first UI that doubles as a WCAG-AA accessibility layer). That one
	project legitimately reaches across 6+ award tracks at once.

	---

	## 1. Track strategy — how ONE build collects MANY prizes
	The whole point of this plan: do not chase tracks separately. Build one coherent app where each
	feature automatically satisfies a track. Map below — every row must have a concrete artifact.

	\| Track / Award \| What unlocks it \| Concrete artifact we ship \|
	\|---\|---\|---\|
	\| Backyard AI podium \| Underserved real-world use-case, real user impact \| Accessibility app + a real first-person demo (you, eyes closed, using it) \|
	\| OpenBMB award \| Use OpenBMB models \| MiniCPM-V (vision/OCR) + VoxCPM (TTS) = two OpenBMB models \|
	\| Cohere award \| Use a Cohere model \| Cohere Transcribe = the STT stage; multilingual = Cohere's sweet spot \|
	\| Tiny Titan ($1,500) \| Primary model ≤ ~4B params \| MiniCPM-V-2 primary at 2.8B → under the cap \|
	\| Best Demo ($1,000) \| A 30–60s demo that lands \| Scripted video: blind-POV menu read aloud in Hindi \|
	\| Off-Brand ($1,500) \| Custom UI, not stock Gradio \| Custom CSS + "Iris" voice-first design system (see §4) \|
	\| Field Notes (badge) \| Public write-up \| HF blog post: "What VLM quality really feels like at 2.8B" \|
	\| Edge / on-device angle \| Runs small enough for hardware \| GGUF int4 roadmap + on-device claim (see §6) \|

	Rule of thumb for the builder: if a feature doesn't move a track forward or keep the core
	pipeline alive, it is out of scope for the hackathon window.

	---

	## 2. Scope discipline (what to build, in order)
	The single biggest risk in a hackathon is a half-working everything. We build a **vertical slice
	first** (one path that works end-to-end), then widen.

	MUST-HAVE (submission is invalid without these):
	1. Webcam image capture in the browser.
	2. Vision pipeline: image + question → text answer (MiniCPM-V on Modal GPU).
	3. TTS: answer text → spoken audio that auto-plays (VoxCPM on Modal GPU).
	4. The "Describe" path working end-to-end without typing.
	5. 3 bundled example images so judges with no webcam can still test.
	6. Graceful failures everywhere (never show a raw traceback).

	SHOULD-HAVE (these win the extra tracks):
	7. "Ask" path: mic → Cohere Transcribe → vision → TTS (the zero-typing loop).
	8. "Read Text" path: fixed OCR prompt.
	9. Custom "Iris" UI + custom CSS (Off-Brand).
	10. Language selector → multilingual TTS (Cohere multilingual story).

	NICE-TO-HAVE (only if time remains):
	11. Bounding-box "zoom & read" tab (`gr.ImageEditor`).
	12. Model-weight caching on a Modal Volume for fast cold starts.
	13. On-device GGUF proof-of-concept note + benchmark table.

	> Ship MUST-HAVE before touching SHOULD-HAVE. Ship SHOULD-HAVE before NICE-TO-HAVE. No exceptions.

	---

	## 3. Architecture
	```
	┌──────────────────────────── HF Space (Gradio 5.x) ────────────────────────────┐
	│ Browser: webcam frame + mic audio + language choice │
	│ │ │
	│ ▼ │
	│ app.py ──(audio bytes)──► Cohere Transcribe │
	│ │ (STT, 2B) ─── runs on Modal ──┐ │
	│ │ ◄───────── question text ─────────────────────────────┘ │
	│ │ │
	│ ├──(image bytes + question)──► MiniCPM-V (2.8B vision/OCR) ── Modal A10G ──┐ │
	│ │ ◄───────────────── answer text ─────────────────────────────────────────┘ │
	│ │ │
	│ ├──(answer text + lang)──► VoxCPM (2B TTS) ── Modal A10G ──┐ │
	│ │ ◄───────────────── WAV bytes ───────────────────────────┘ │
	│ ▼ │
	│ auto-playing audio + large-text transcript + ARIA live announcements │
	└─────────────────────────────────────────────────────────────────────────────────┘
	```

	- Frontend / orchestration: Gradio 5.x `gr.Blocks` on a Hugging Face Space.
	- Heavy compute: Modal serverless GPU (A10G). Vision, STT, and TTS all run as Modal functions.
	- Why split this way: HF Spaces free tier can't comfortably hold 3 models hot; Modal gives
	on-demand GPU and keeps the Space light. The Space holds the token to call Modal.
	- No other cloud APIs. Only sponsor models. This keeps "off-grid-ish" eligibility intact
	(note in README that Modal is infra, not a model provider).

	### Files
	```
	third-eye/
	├── app.py # Gradio UI + pipeline orchestration
	├── modal_backend.py # Modal app: vision + TTS (+ STT) functions, GPU
	├── cohere_stt.py # Cohere Transcribe wrapper (called via Modal)
	├── utils.py # image<->bytes, bytes<->wav helpers, safe-call wrapper
	├── requirements.txt
	├── .env.example # MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, HF_TOKEN
	├── README.md # HF Space frontmatter + story + edge-device section
	├── BLOG.md # Field Notes draft (publish to HF blog)
	├── DEMO_SCRIPT.md # 45-sec shot list for Best Demo
	└── assets/
	├── custom.css # "Iris" design system, WCAG-AA, prefers-reduced-motion
	├── sample_menu.jpg
	├── sample_label.jpg
	└── sample_sign.jpg
	```

	---

	## 4. Futuristic UI — the "Iris" design system (Off-Brand track)
	The narrative that wins Off-Brand: *"We designed for people who can't see the screen — and the
	visual layer still looks like it's from 2035."* Accessibility constraints (high contrast, huge
	targets, clear focus) are turned into a deliberate, futuristic aesthetic, not an afterthought.

	Concept: the app is an eye. A single glowing iris orb sits center-stage and changes
	state as the pipeline runs. The user never hunts for controls — there is essentially one big
	action, voice-first.

	State machine (the orb visibly + audibly reflects each):
	`IDLE` (slow breathing glow) → `LISTENING` (reactive ring / waveform) → `SEEING` (scan-line sweeps
	the captured frame) → `THINKING` (orb tightens, pulses faster) → `SPEAKING` (waveform from the
	orb, audio auto-plays) → back to `IDLE`. Each transition fires an ARIA live announcement and a
	short audio cue, so a blind user tracks state by ear.

	Visual language:
	- Void background `#06070A` with a faint radial vignette + subtle film grain.
	- Accent: indigo→cyan gradient (`#5B7CFA → #3DE0FF`). Glow via layered `box-shadow`.
	- Text: `#F5F7FA`, base 20px, outputs 24px+, line-height 1.7. Contrast ≥ WCAG AA.
	- Surfaces: glassmorphism panels — `backdrop-filter: blur`, 1px hairline border, soft inner glow.
	- Primary control: one large circular "👁 Tap / Speak" button, min 96px target, neon focus ring.
	- Motion: breathing/pulse on the orb; scan-line on image; ALL motion gated behind
	`@media (prefers-reduced-motion: reduce)`.
	- Focus rings: thick cyan glow — serves keyboard users and the futuristic look at once.
	- Typography: Inter / system stack; tabular, calm, generous spacing.

	Voice-first interaction rules:
	- Audio answer auto-plays the moment it's ready (no "press play").
	- Big targets, full-width controls, logical tab order, every control labeled for screen readers.
	- A persistent ARIA `live=polite` status line mirrors the orb state in words.
	- Mic is the primary input; the typed `gr.Textbox` exists only as a never-block fallback.

	> Implementation reality: Gradio theming + a hand-written `assets/custom.css`. The orb + states are
	> CSS (gradients, keyframes, `box-shadow`) driven by a CSS class that `app.py` swaps via Gradio
	> `elem_classes`/updates. Keep it pure-CSS so a smaller model can build it reliably. Detailed CSS
	> targets are spelled out in `AGENT_PROMPT.md`.

	---

	## 5. Phased build plan (what the agent does, milestone by milestone)
	Each phase ends with a verifiable checkpoint. The agent must not advance until the checkpoint
	passes. (Full instructions live in `AGENT_PROMPT.md`; this is the human-readable map.)

	- Phase 0 — Verify reality (CRITICAL, do first). Before writing inference code, open each model
	card (MiniCPM-V-2, VoxCPM, Cohere Transcribe) and confirm the exact load + call API and the
	exact model IDs. These APIs differ by version. Use the model card, not assumptions. If an ID
	doesn't resolve, stop and report — do not silently substitute.
	- Phase 1 — Scaffold. Create files, `requirements.txt`, `.env.example`, stub functions that
	return fake data. Get the Gradio UI rendering with the Iris layout and the 3 tabs. Checkpoint:
	`python app.py` launches locally and the UI loads with placeholder responses.
	- Phase 2 — Modal vision. Implement `describe_scene` on Modal; deploy; call it from a tiny test
	script with a sample image. Checkpoint: a real text description comes back for `sample_menu.jpg`.
	- Phase 3 — Modal TTS. Implement `speak`; return WAV bytes; play locally. Checkpoint: a WAV
	file plays intelligible speech of a known sentence.
	- Phase 4 — Wire "Describe" end-to-end. Image → vision → TTS → auto-play in the UI. Checkpoint:
	pick a sample image in the Space, hear it described. This is the minimum valid submission.
	- Phase 5 — STT + "Ask". Cohere Transcribe → vision → TTS. Checkpoint: speak a question, hear
	an answer, zero typing.
	- Phase 6 — "Read Text" + language selector + custom CSS polish. Checkpoint: OCR path works;
	Hindi TTS works; UI matches the Iris spec.
	- Phase 7 — Hardening. Cold-start progress, every-exception `gr.Warning`, mic/TTS fallbacks,
	bundle 3 examples. Checkpoint: kill the mic, kill TTS — app degrades gracefully, never crashes.
	- Phase 8 — Submission assets. README frontmatter, BLOG.md, DEMO_SCRIPT.md, push to Space.
	- Phase 9 (if time) — NICE-TO-HAVE. Bounding-box tab, Modal Volume weight cache, GGUF note.

	---

	## 6. Edge-device / on-device story (a differentiator, used carefully)
	This is a claim + roadmap, not a required runtime path — frame it honestly so it strengthens
	the entry instead of inviting "it doesn't actually run on a phone" pushback.

	- The chosen models are deliberately tiny: MiniCPM-V-2 (2.8B), VoxCPM (~2B), Cohere Transcribe (2B).
	MiniCPM-V is well known to run on-device when quantized to int4 (GGUF / llama.cpp), including
	on phones.
	- Narrative: "We run on Modal GPU for the hackathon demo, but the entire model stack is small
	enough to ship offline, on-device — which is exactly what a blind user wants: private,
	no-connectivity assistance that never sends their medicine labels to a server."
	- Artifact (NICE-TO-HAVE): a short README section + a benchmark-style table (model, params,
	int4 size, target device) and a one-paragraph "On-device roadmap." If time allows, a tiny
	llama.cpp/GGUF screenshot of MiniCPM-V running locally. If not, keep it as a stated roadmap —
	do not claim a working phone build you didn't produce.

	---

	## 7. Risk register + reality checks (read before building)
	\| Risk \| Likelihood \| Mitigation \|
	\|---\|---\|---\|
	\| Model IDs / APIs differ from the prompt's guesses \| High \| Phase 0: verify every model card first; never assume the `.chat()`/`.synthesize()` signature \|
	\| `model.chat(image=None, msgs=[{content:[image, prompt]}])` vs `chat(image=image, msgs=[{content:prompt}])` differs by MiniCPM-V version \| High \| Follow the exact model card for the chosen version; test with one image before wiring UI \|
	\| VoxCPM synthesize API guessed (`model.synthesize`) \| High \| Read VoxCPM card; it may need a separate generate call / reference audio. Verify before Phase 3 \|
	\| Cohere Transcribe not a plain `transformers` ASR pipeline \| Med \| Verify; if it needs a custom call, wrap it. Keep a typed-question fallback so STT failure never blocks \|
	\| Modal cold start downloads GBs of weights → slow first call \| High \| Cache weights on a `modal.Volume` (Phase 9) + `gr.Progress("first run ~30s")` \|
	\| Space can't reach Modal (no token) \| Med \| Put `MODAL_TOKEN_ID/SECRET` in Space secrets; document in README + `.env.example` \|
	\| 2.8B VLM description quality too weak \| Med \| Documented fallback to MiniCPM-V-4_5 (8B) — note "Tiny Titan badge forfeited" in README; never silent-swap \|
	\| Trying to build everything → nothing works \| High \| Strict MUST→SHOULD→NICE ordering (§2); vertical slice first \|
	\| Showing tracebacks to judges \| Med \| Wrap every stage; `gr.Warning` only; safe-call helper in utils \|

	---

	## 8. Master checklist (tick before you submit)
	Pipeline
	- [ ] Webcam capture works in-browser
	- [ ] Image → MiniCPM-V → text answer works
	- [ ] Text → VoxCPM → WAV, auto-plays
	- [ ] Mic → Cohere Transcribe → question text works
	- [ ] "Describe" / "Ask" / "Read Text" all function end-to-end
	- [ ] Language selector drives multilingual TTS (≥ English + Hindi)

	Resilience
	- [ ] Cold-start progress indicator shown
	- [ ] Mic failure → typed `gr.Textbox` fallback
	- [ ] TTS failure → large-text output fallback
	- [ ] Every exception surfaces as `gr.Warning`, never a raw traceback
	- [ ] 3 example images bundled and selectable without a webcam
	- [ ] Full pipeline tested image→STT→VLM→TTS→playback on a fresh load

	Tracks / submission
	- [ ] README frontmatter copied verbatim (tags include both OpenBMB models + Cohere + tiny-titan + off-brand)
	- [ ] Custom CSS / Iris UI live (Off-Brand)
	- [ ] 45-sec demo video recorded (Best Demo)
	- [ ] Field Notes blog drafted/published (badge)
	- [ ] Edge/on-device section in README (claim + roadmap, honest)
	- [ ] Space builds clean and loads on a cold visit
	- [ ] Param budget noted: 2.8B primary ≤ 4B cap (Tiny Titan)

	---

	## 9. README content plan
	Sections, in order:
	1. Frontmatter — copy verbatim from `AGENT_PROMPT.md` (title, emoji, sdk, tags).
	2. What it is — 2 sentences, the pitch.
	3. Who it's for — blind / low-vision; the zero-typing promise.
	4. How to use — pick image or webcam → speak → listen. With the 3 examples.
	5. Models & sizes — table (role, model, params, sponsor); call out 2.8B Tiny Titan.
	6. Architecture — the diagram from §3, one paragraph.
	7. On-device / edge — the honest claim + roadmap (§6).
	8. Accessibility & design — Iris design system, WCAG-AA, voice-first.
	9. Run it yourself — env vars, Modal deploy, Space secrets.
	10. Credits / sponsors — OpenBMB, Cohere, Modal, HF.

	---

	## 10. Best-Demo video script (45–60s)
	1. 0–5s: black screen, eye opens (the orb). Text: "What if your phone could see for you?"
	2. 5–20s: first-person, eyes closed / blindfold. Hold phone/webcam to a restaurant menu. Tap.
	3. 20–35s: orb cycles Listening→Seeing→Thinking→Speaking; audio reads the menu aloud — in Hindi.
	4. 35–50s: quick cuts — medicine label read aloud, street sign read aloud.
	5. 50–60s: "Third Eye. Built on 2.8B params. Small enough to live on your phone." Logo + orb.

	Keep it real, one continuous-feeling take, minimal text, let the audio answer be the hero moment.

	---

	## 11. How to use these two files
	1. Read this `PLAN.md` to hold the strategy in your head.
	2. Open `AGENT_PROMPT.md`, copy the whole thing, paste into Codex / OpenCode as the task.
	3. That prompt makes the smaller model do Phase 0 verification first, then build phase-by-phase
	with checkpoints — so it can't run ahead and hallucinate model APIs.
	4. When it reports a model-card mismatch, that's expected — let it adapt to the real API and continue.
	```