Spaces:

build-small-hackathon
/

third-eye

Sleeping

App Files Files Community

third-eye / PLAN.md

mitvho09

Upload folder using huggingface_hub

031e3f9 verified 18 days ago

preview code

Raw

History Blame Contribute Delete

17.8 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

A1 — "Third Eye" · Master Win Plan

HF Build Small Hackathon 2026 · Backyard AI Track

This is the strategy + handoff document. The build itself is driven by AGENT_PROMPT.md (written for a smaller model). This file is for you (the human) + as context for any agent.

0. One-line pitch

Third Eye — point a webcam at anything, speak a question in your own language, and hear the answer back. A fully voice-driven "second sight" for blind and low-vision people, running on sub-3B sponsor models small enough to live on an edge device.

Why it wins: it is the rare hackathon entry that is emotionally undeniable (a blind person hearing a menu read aloud), technically tidy (tiny models, clean pipeline), and visually striking (a futuristic voice-first UI that doubles as a WCAG-AA accessibility layer). That one project legitimately reaches across 6+ award tracks at once.

1. Track strategy — how ONE build collects MANY prizes

The whole point of this plan: do not chase tracks separately. Build one coherent app where each feature automatically satisfies a track. Map below — every row must have a concrete artifact.

Track / Award	What unlocks it	Concrete artifact we ship
Backyard AI podium	Underserved real-world use-case, real user impact	Accessibility app + a real first-person demo (you, eyes closed, using it)
OpenBMB award	Use OpenBMB models	MiniCPM-V (vision/OCR) + VoxCPM (TTS) = two OpenBMB models
Cohere award	Use a Cohere model	Cohere Transcribe = the STT stage; multilingual = Cohere's sweet spot
Tiny Titan ($1,500)	Primary model ≤ ~4B params	MiniCPM-V-2 primary at 2.8B → under the cap
Best Demo ($1,000)	A 30–60s demo that lands	Scripted video: blind-POV menu read aloud in Hindi
Off-Brand ($1,500)	Custom UI, not stock Gradio	Custom CSS + "Iris" voice-first design system (see §4)
Field Notes (badge)	Public write-up	HF blog post: "What VLM quality really feels like at 2.8B"
Edge / on-device angle	Runs small enough for hardware	GGUF int4 roadmap + on-device claim (see §6)

Rule of thumb for the builder: if a feature doesn't move a track forward or keep the core pipeline alive, it is out of scope for the hackathon window.

2. Scope discipline (what to build, in order)

The single biggest risk in a hackathon is a half-working everything. We build a vertical slice first (one path that works end-to-end), then widen.

MUST-HAVE (submission is invalid without these):

Webcam image capture in the browser.
Vision pipeline: image + question → text answer (MiniCPM-V on Modal GPU).
TTS: answer text → spoken audio that auto-plays (VoxCPM on Modal GPU).
The "Describe" path working end-to-end without typing.
3 bundled example images so judges with no webcam can still test.
Graceful failures everywhere (never show a raw traceback).

SHOULD-HAVE (these win the extra tracks): 7. "Ask" path: mic → Cohere Transcribe → vision → TTS (the zero-typing loop). 8. "Read Text" path: fixed OCR prompt. 9. Custom "Iris" UI + custom CSS (Off-Brand). 10. Language selector → multilingual TTS (Cohere multilingual story).

NICE-TO-HAVE (only if time remains): 11. Bounding-box "zoom & read" tab (gr.ImageEditor). 12. Model-weight caching on a Modal Volume for fast cold starts. 13. On-device GGUF proof-of-concept note + benchmark table.

Ship MUST-HAVE before touching SHOULD-HAVE. Ship SHOULD-HAVE before NICE-TO-HAVE. No exceptions.

3. Architecture

            ┌──────────────────────────── HF Space (Gradio 5.x) ────────────────────────────┐
            │  Browser: webcam frame + mic audio + language choice                            │
            │      │                                                                          │
            │      ▼                                                                           │
            │  app.py  ──(audio bytes)──►  Cohere Transcribe                                   │
            │      │                         (STT, 2B)  ─── runs on Modal ──┐                  │
            │      │  ◄───────── question text ─────────────────────────────┘                  │
            │      │                                                                          │
            │      ├──(image bytes + question)──►  MiniCPM-V (2.8B vision/OCR) ── Modal A10G ──┐ │
            │      │  ◄───────────────── answer text ─────────────────────────────────────────┘ │
            │      │                                                                          │
            │      ├──(answer text + lang)──►  VoxCPM (2B TTS) ── Modal A10G ──┐                │
            │      │  ◄───────────────── WAV bytes ───────────────────────────┘                │
            │      ▼                                                                          │
            │  auto-playing audio  +  large-text transcript  +  ARIA live announcements       │
            └─────────────────────────────────────────────────────────────────────────────────┘

Frontend / orchestration: Gradio 5.x gr.Blocks on a Hugging Face Space.
Heavy compute: Modal serverless GPU (A10G). Vision, STT, and TTS all run as Modal functions.
Why split this way: HF Spaces free tier can't comfortably hold 3 models hot; Modal gives on-demand GPU and keeps the Space light. The Space holds the token to call Modal.
No other cloud APIs. Only sponsor models. This keeps "off-grid-ish" eligibility intact (note in README that Modal is infra, not a model provider).

Files

third-eye/
├── app.py              # Gradio UI + pipeline orchestration
├── modal_backend.py    # Modal app: vision + TTS (+ STT) functions, GPU
├── cohere_stt.py       # Cohere Transcribe wrapper (called via Modal)
├── utils.py            # image<->bytes, bytes<->wav helpers, safe-call wrapper
├── requirements.txt
├── .env.example        # MODAL_TOKEN_ID, MODAL_TOKEN_SECRET, HF_TOKEN
├── README.md           # HF Space frontmatter + story + edge-device section
├── BLOG.md             # Field Notes draft (publish to HF blog)
├── DEMO_SCRIPT.md      # 45-sec shot list for Best Demo
└── assets/
    ├── custom.css      # "Iris" design system, WCAG-AA, prefers-reduced-motion
    ├── sample_menu.jpg
    ├── sample_label.jpg
    └── sample_sign.jpg

4. Futuristic UI — the "Iris" design system (Off-Brand track)

The narrative that wins Off-Brand: "We designed for people who can't see the screen — and the visual layer still looks like it's from 2035." Accessibility constraints (high contrast, huge targets, clear focus) are turned into a deliberate, futuristic aesthetic, not an afterthought.

Concept: the app is an eye. A single glowing iris orb sits center-stage and changes state as the pipeline runs. The user never hunts for controls — there is essentially one big action, voice-first.

State machine (the orb visibly + audibly reflects each): IDLE (slow breathing glow) → LISTENING (reactive ring / waveform) → SEEING (scan-line sweeps the captured frame) → THINKING (orb tightens, pulses faster) → SPEAKING (waveform from the orb, audio auto-plays) → back to IDLE. Each transition fires an ARIA live announcement and a short audio cue, so a blind user tracks state by ear.

Visual language:

Void background #06070A with a faint radial vignette + subtle film grain.
Accent: indigo→cyan gradient (#5B7CFA → #3DE0FF). Glow via layered box-shadow.
Text: #F5F7FA, base 20px, outputs 24px+, line-height 1.7. Contrast ≥ WCAG AA.
Surfaces: glassmorphism panels — backdrop-filter: blur, 1px hairline border, soft inner glow.
Primary control: one large circular "👁 Tap / Speak" button, min 96px target, neon focus ring.
Motion: breathing/pulse on the orb; scan-line on image; ALL motion gated behind @media (prefers-reduced-motion: reduce).
Focus rings: thick cyan glow — serves keyboard users and the futuristic look at once.
Typography: Inter / system stack; tabular, calm, generous spacing.

Voice-first interaction rules:

Audio answer auto-plays the moment it's ready (no "press play").
Big targets, full-width controls, logical tab order, every control labeled for screen readers.
A persistent ARIA live=polite status line mirrors the orb state in words.
Mic is the primary input; the typed gr.Textbox exists only as a never-block fallback.

Implementation reality: Gradio theming + a hand-written assets/custom.css. The orb + states are CSS (gradients, keyframes, box-shadow) driven by a CSS class that app.py swaps via Gradio elem_classes/updates. Keep it pure-CSS so a smaller model can build it reliably. Detailed CSS targets are spelled out in AGENT_PROMPT.md.

5. Phased build plan (what the agent does, milestone by milestone)

Each phase ends with a verifiable checkpoint. The agent must not advance until the checkpoint passes. (Full instructions live in AGENT_PROMPT.md; this is the human-readable map.)

Phase 0 — Verify reality (CRITICAL, do first). Before writing inference code, open each model card (MiniCPM-V-2, VoxCPM, Cohere Transcribe) and confirm the exact load + call API and the exact model IDs. These APIs differ by version. Use the model card, not assumptions. If an ID doesn't resolve, stop and report — do not silently substitute.
Phase 1 — Scaffold. Create files, requirements.txt, .env.example, stub functions that return fake data. Get the Gradio UI rendering with the Iris layout and the 3 tabs. Checkpoint: python app.py launches locally and the UI loads with placeholder responses.
Phase 2 — Modal vision. Implement describe_scene on Modal; deploy; call it from a tiny test script with a sample image. Checkpoint: a real text description comes back for sample_menu.jpg.
Phase 3 — Modal TTS. Implement speak; return WAV bytes; play locally. Checkpoint: a WAV file plays intelligible speech of a known sentence.
Phase 4 — Wire "Describe" end-to-end. Image → vision → TTS → auto-play in the UI. Checkpoint: pick a sample image in the Space, hear it described. This is the minimum valid submission.
Phase 5 — STT + "Ask". Cohere Transcribe → vision → TTS. Checkpoint: speak a question, hear an answer, zero typing.
Phase 6 — "Read Text" + language selector + custom CSS polish. Checkpoint: OCR path works; Hindi TTS works; UI matches the Iris spec.
Phase 7 — Hardening. Cold-start progress, every-exception gr.Warning, mic/TTS fallbacks, bundle 3 examples. Checkpoint: kill the mic, kill TTS — app degrades gracefully, never crashes.
Phase 8 — Submission assets. README frontmatter, BLOG.md, DEMO_SCRIPT.md, push to Space.
Phase 9 (if time) — NICE-TO-HAVE. Bounding-box tab, Modal Volume weight cache, GGUF note.

6. Edge-device / on-device story (a differentiator, used carefully)

This is a claim + roadmap, not a required runtime path — frame it honestly so it strengthens the entry instead of inviting "it doesn't actually run on a phone" pushback.

The chosen models are deliberately tiny: MiniCPM-V-2 (2.8B), VoxCPM (~2B), Cohere Transcribe (2B). MiniCPM-V is well known to run on-device when quantized to int4 (GGUF / llama.cpp), including on phones.
Narrative: "We run on Modal GPU for the hackathon demo, but the entire model stack is small enough to ship offline, on-device — which is exactly what a blind user wants: private, no-connectivity assistance that never sends their medicine labels to a server."
Artifact (NICE-TO-HAVE): a short README section + a benchmark-style table (model, params, int4 size, target device) and a one-paragraph "On-device roadmap." If time allows, a tiny llama.cpp/GGUF screenshot of MiniCPM-V running locally. If not, keep it as a stated roadmap — do not claim a working phone build you didn't produce.

7. Risk register + reality checks (read before building)

Risk	Likelihood	Mitigation
Model IDs / APIs differ from the prompt's guesses	High	Phase 0: verify every model card first; never assume the `.chat()`/`.synthesize()` signature
`model.chat(image=None, msgs=[{content:[image, prompt]}])` vs `chat(image=image, msgs=[{content:prompt}])` differs by MiniCPM-V version	High	Follow the exact model card for the chosen version; test with one image before wiring UI
VoxCPM synthesize API guessed (`model.synthesize`)	High	Read VoxCPM card; it may need a separate generate call / reference audio. Verify before Phase 3
Cohere Transcribe not a plain `transformers` ASR pipeline	Med	Verify; if it needs a custom call, wrap it. Keep a typed-question fallback so STT failure never blocks
Modal cold start downloads GBs of weights → slow first call	High	Cache weights on a `modal.Volume` (Phase 9) + `gr.Progress("first run ~30s")`
Space can't reach Modal (no token)	Med	Put `MODAL_TOKEN_ID/SECRET` in Space secrets; document in README + `.env.example`
2.8B VLM description quality too weak	Med	Documented fallback to MiniCPM-V-4_5 (8B) — note "Tiny Titan badge forfeited" in README; never silent-swap
Trying to build everything → nothing works	High	Strict MUST→SHOULD→NICE ordering (§2); vertical slice first
Showing tracebacks to judges	Med	Wrap every stage; `gr.Warning` only; safe-call helper in utils

8. Master checklist (tick before you submit)

Pipeline

Webcam capture works in-browser
Image → MiniCPM-V → text answer works
Text → VoxCPM → WAV, auto-plays
Mic → Cohere Transcribe → question text works
"Describe" / "Ask" / "Read Text" all function end-to-end
Language selector drives multilingual TTS (≥ English + Hindi)

Resilience

Cold-start progress indicator shown
Mic failure → typed gr.Textbox fallback
TTS failure → large-text output fallback
Every exception surfaces as gr.Warning, never a raw traceback
3 example images bundled and selectable without a webcam
Full pipeline tested image→STT→VLM→TTS→playback on a fresh load

Tracks / submission

README frontmatter copied verbatim (tags include both OpenBMB models + Cohere + tiny-titan + off-brand)
Custom CSS / Iris UI live (Off-Brand)
45-sec demo video recorded (Best Demo)
Field Notes blog drafted/published (badge)
Edge/on-device section in README (claim + roadmap, honest)
Space builds clean and loads on a cold visit
Param budget noted: 2.8B primary ≤ 4B cap (Tiny Titan)

9. README content plan

Sections, in order:

Frontmatter — copy verbatim from AGENT_PROMPT.md (title, emoji, sdk, tags).
What it is — 2 sentences, the pitch.
Who it's for — blind / low-vision; the zero-typing promise.
How to use — pick image or webcam → speak → listen. With the 3 examples.
Models & sizes — table (role, model, params, sponsor); call out 2.8B Tiny Titan.
Architecture — the diagram from §3, one paragraph.
On-device / edge — the honest claim + roadmap (§6).
Accessibility & design — Iris design system, WCAG-AA, voice-first.
Run it yourself — env vars, Modal deploy, Space secrets.
Credits / sponsors — OpenBMB, Cohere, Modal, HF.

10. Best-Demo video script (45–60s)

0–5s: black screen, eye opens (the orb). Text: "What if your phone could see for you?"
5–20s: first-person, eyes closed / blindfold. Hold phone/webcam to a restaurant menu. Tap.
20–35s: orb cycles Listening→Seeing→Thinking→Speaking; audio reads the menu aloud — in Hindi.
35–50s: quick cuts — medicine label read aloud, street sign read aloud.
50–60s: "Third Eye. Built on 2.8B params. Small enough to live on your phone." Logo + orb.

Keep it real, one continuous-feeling take, minimal text, let the audio answer be the hero moment.

11. How to use these two files

Read this PLAN.md to hold the strategy in your head.
Open AGENT_PROMPT.md, copy the whole thing, paste into Codex / OpenCode as the task.
That prompt makes the smaller model do Phase 0 verification first, then build phase-by-phase with checkpoints — so it can't run ahead and hallucinate model APIs.
When it reports a model-card mismatch, that's expected — let it adapt to the real API and continue.