Spaces:
Sleeping
Sleeping
| title: Third Eye | |
| emoji: "\U0001F441" | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: "5.50.0" | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - hackathon | |
| - build-small | |
| - backyard-ai | |
| - accessibility | |
| - blind | |
| - qwen2-vl | |
| - openbmb/VoxCPM2 | |
| - CohereLabs/cohere-transcribe-03-2026 | |
| - multimodal | |
| - voice-assistant | |
| # Third Eye | |
| Third Eye is a voice-first visual assistant for blind and low-vision people. Point a | |
| camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer | |
| without typing. | |
| ## How to use | |
| 1. Open **Describe**, **Ask**, or **Read Text**. | |
| 2. Capture a webcam image, upload one, or select a bundled example. | |
| 3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable. | |
| 4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript. | |
| The Space starts in mock mode when Modal credentials are absent. Mock mode validates the | |
| complete user interface without uploading images. Real inference activates automatically | |
| when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured. | |
| ## Models | |
| | Stage | Model | Parameters | | |
| |---|---|---:| | |
| | Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B | | |
| | Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B | | |
| | Speech synthesis | `openbmb/VoxCPM2` | 2.29B | | |
| The vision model is 3B parameters and stays below the 4B limit. It is bilingual in | |
| English and Chinese and has strong document/OCR performance for menus, labels, and signs. | |
| `Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy | |
| Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a | |
| single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all | |
| three models share one runtime β required for the single-environment ZeroGPU deployment. | |
| ## Architecture | |
| The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration. | |
| Inference is routed through a small backend abstraction (`app.infer`) with three | |
| interchangeable backends, auto-selected at runtime: | |
| - **ZeroGPU** (`zerogpu_backend.py`) β all three models run in-process on a Hugging Face | |
| ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout. | |
| - **Modal** (`modal_backend.py`) β three separately versioned Modal A10G functions with a | |
| shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present. | |
| - **Preview (mock)** β runs the full interface with no GPU and never uploads the image. | |
| Active locally when no GPU backend is detected. | |
| ## Accessibility and Iris | |
| Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings, | |
| high-contrast glass panels, large targets, reduced-motion support, and a persistent textual | |
| status. Its visual state moves through listening, seeing, thinking, and speaking while the | |
| same state is exposed as text for screen-reader users. | |
| ## On-device roadmap | |
| The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships | |
| official GGUF and quantized variants, making an offline visual path technically credible, but | |
| VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work. | |
| The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured | |
| memory, latency, battery, and quality results for the full stack. No on-device runtime is | |
| claimed here. | |
| ## Run locally | |
| ```bash | |
| python -m venv .venv | |
| .venv\Scripts\activate | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| Mock mode is automatic without credentials. To force it: | |
| ```bash | |
| set THIRD_EYE_MOCK=true | |
| python app.py | |
| ``` | |
| On Windows, the canonical launcher is: | |
| ```powershell | |
| .\start.ps1 | |
| ``` | |
| It defaults to `0.0.0.0:7860`, and you can override the bind address with | |
| `THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`. | |
| ## Run on Hugging Face ZeroGPU | |
| This Space is built to run all inference in-process on ZeroGPU β no external GPU service. | |
| 1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings. | |
| 2. Accept access to `CohereLabs/cohere-transcribe-03-2026`. | |
| 3. Add an `HF_TOKEN` Space secret with access to that gated model. | |
| 4. Push this repo. `requirements.txt` installs the full model stack; the app | |
| auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`). | |
| Models lazy-load on first use, so the first request of each kind is slower while weights | |
| download and warm up. Use the **Diagnostics β Pre-load models** button to warm them ahead | |
| of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`. | |
| ## Deploy the Modal backend | |
| 1. Accept access to `CohereLabs/cohere-transcribe-03-2026`. | |
| 2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`. | |
| 3. Authenticate Modal locally. | |
| 4. Deploy the backend: | |
| ```bash | |
| modal deploy modal_backend.py | |
| ``` | |
| 5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets. | |
| Run the remote smoke test after deployment: | |
| ```bash | |
| modal run modal_backend.py --image-path assets/sample_menu.jpg | |
| ``` | |
| This creates `out.wav` after a real vision and TTS pass. | |
| ## Verification status | |
| - Local mock UI and utility tests can run without cloud credentials. | |
| - Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal). | |
| - Cohere STT additionally requires gated-model access and `HF_TOKEN`. | |
| - No training is required; all three stages use pretrained weights. | |
| - Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`. | |
| ## Credits | |
| Built with OpenAI Codex. | |