--- title: Third Eye emoji: "\U0001F441" colorFrom: indigo colorTo: blue sdk: gradio sdk_version: "5.50.0" app_file: app.py pinned: false tags: - hackathon - build-small - backyard-ai - accessibility - blind - qwen2-vl - openbmb/VoxCPM2 - CohereLabs/cohere-transcribe-03-2026 - multimodal - voice-assistant --- # Third Eye Third Eye is a voice-first visual assistant for blind and low-vision people. Point a camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer without typing. ## How to use 1. Open **Describe**, **Ask**, or **Read Text**. 2. Capture a webcam image, upload one, or select a bundled example. 3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable. 4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript. The Space starts in mock mode when Modal credentials are absent. Mock mode validates the complete user interface without uploading images. Real inference activates automatically when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured. ## Models | Stage | Model | Parameters | |---|---|---:| | Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B | | Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B | | Speech synthesis | `openbmb/VoxCPM2` | 2.29B | The vision model is 3B parameters and stays below the 4B limit. It is bilingual in English and Chinese and has strong document/OCR performance for menus, labels, and signs. `Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all three models share one runtime — required for the single-environment ZeroGPU deployment. ## Architecture The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration. Inference is routed through a small backend abstraction (`app.infer`) with three interchangeable backends, auto-selected at runtime: - **ZeroGPU** (`zerogpu_backend.py`) — all three models run in-process on a Hugging Face ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout. - **Modal** (`modal_backend.py`) — three separately versioned Modal A10G functions with a shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present. - **Preview (mock)** — runs the full interface with no GPU and never uploads the image. Active locally when no GPU backend is detected. ## Accessibility and Iris Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings, high-contrast glass panels, large targets, reduced-motion support, and a persistent textual status. Its visual state moves through listening, seeing, thinking, and speaking while the same state is exposed as text for screen-reader users. ## On-device roadmap The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships official GGUF and quantized variants, making an offline visual path technically credible, but VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work. The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured memory, latency, battery, and quality results for the full stack. No on-device runtime is claimed here. ## Run locally ```bash python -m venv .venv .venv\Scripts\activate pip install -r requirements.txt python app.py ``` Mock mode is automatic without credentials. To force it: ```bash set THIRD_EYE_MOCK=true python app.py ``` On Windows, the canonical launcher is: ```powershell .\start.ps1 ``` It defaults to `0.0.0.0:7860`, and you can override the bind address with `THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`. ## Run on Hugging Face ZeroGPU This Space is built to run all inference in-process on ZeroGPU — no external GPU service. 1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings. 2. Accept access to `CohereLabs/cohere-transcribe-03-2026`. 3. Add an `HF_TOKEN` Space secret with access to that gated model. 4. Push this repo. `requirements.txt` installs the full model stack; the app auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`). Models lazy-load on first use, so the first request of each kind is slower while weights download and warm up. Use the **Diagnostics → Pre-load models** button to warm them ahead of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`. ## Deploy the Modal backend 1. Accept access to `CohereLabs/cohere-transcribe-03-2026`. 2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`. 3. Authenticate Modal locally. 4. Deploy the backend: ```bash modal deploy modal_backend.py ``` 5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets. Run the remote smoke test after deployment: ```bash modal run modal_backend.py --image-path assets/sample_menu.jpg ``` This creates `out.wav` after a real vision and TTS pass. ## Verification status - Local mock UI and utility tests can run without cloud credentials. - Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal). - Cohere STT additionally requires gated-model access and `HF_TOKEN`. - No training is required; all three stages use pretrained weights. - Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`. ## Credits Built with OpenAI Codex.