Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
title: Third Eye
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
tags:
- hackathon
- build-small
- backyard-ai
- accessibility
- blind
- qwen2-vl
- openbmb/VoxCPM2
- CohereLabs/cohere-transcribe-03-2026
- multimodal
- voice-assistant
Third Eye
Third Eye is a voice-first visual assistant for blind and low-vision people. Point a camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer without typing.
How to use
- Open Describe, Ask, or Read Text.
- Capture a webcam image, upload one, or select a bundled example.
- Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
- Choose English or Chinese, then listen to the answer and read the high-contrast transcript.
The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
complete user interface without uploading images. Real inference activates automatically
when MODAL_TOKEN_ID and MODAL_TOKEN_SECRET are configured.
Models
| Stage | Model | Parameters |
|---|---|---|
| Vision and OCR | Qwen/Qwen2.5-VL-3B-Instruct |
3B |
| Speech recognition | CohereLabs/cohere-transcribe-03-2026 |
2.07B |
| Speech synthesis | openbmb/VoxCPM2 |
2.29B |
The vision model is 3B parameters and stays below the 4B limit. It is bilingual in English and Chinese and has strong document/OCR performance for menus, labels, and signs.
Qwen2.5-VL replaced the earlier openbmb/MiniCPM-V-2. MiniCPM-V-2 pins a legacy
Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
three models share one runtime β required for the single-environment ZeroGPU deployment.
Architecture
The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
Inference is routed through a small backend abstraction (app.infer) with three
interchangeable backends, auto-selected at runtime:
- ZeroGPU (
zerogpu_backend.py) β all three models run in-process on a Hugging Face ZeroGPU slice via@spaces.GPU. One environment, modern Transformers throughout. - Modal (
modal_backend.py) β three separately versioned Modal A10G functions with a shared weight cache. Selected whenMODAL_TOKEN_ID/MODAL_TOKEN_SECRETare present. - Preview (mock) β runs the full interface with no GPU and never uploads the image. Active locally when no GPU backend is detected.
Accessibility and Iris
Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings, high-contrast glass panels, large targets, reduced-motion support, and a persistent textual status. Its visual state moves through listening, seeing, thinking, and speaking while the same state is exposed as text for screen-reader users.
On-device roadmap
The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships official GGUF and quantized variants, making an offline visual path technically credible, but VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work. The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured memory, latency, battery, and quality results for the full stack. No on-device runtime is claimed here.
Run locally
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python app.py
Mock mode is automatic without credentials. To force it:
set THIRD_EYE_MOCK=true
python app.py
On Windows, the canonical launcher is:
.\start.ps1
It defaults to 0.0.0.0:7860, and you can override the bind address with
THIRD_EYE_HOST or the port with THIRD_EYE_PORT / PORT.
Run on Hugging Face ZeroGPU
This Space is built to run all inference in-process on ZeroGPU β no external GPU service.
- Create a Gradio Space and set its hardware to ZeroGPU in the Space settings.
- Accept access to
CohereLabs/cohere-transcribe-03-2026. - Add an
HF_TOKENSpace secret with access to that gated model. - Push this repo.
requirements.txtinstalls the full model stack; the app auto-detects thespacesruntime and serves live inference (THIRD_EYE_BACKEND=auto).
Models lazy-load on first use, so the first request of each kind is slower while weights
download and warm up. Use the Diagnostics β Pre-load models button to warm them ahead
of a demo. Force a backend explicitly with THIRD_EYE_BACKEND=zerogpu|modal|mock.
Deploy the Modal backend
- Accept access to
CohereLabs/cohere-transcribe-03-2026. - Create a Modal secret named
third-eye-hfcontainingHF_TOKEN. - Authenticate Modal locally.
- Deploy the backend:
modal deploy modal_backend.py
- Add
MODAL_TOKEN_IDandMODAL_TOKEN_SECRETas Hugging Face Space secrets.
Run the remote smoke test after deployment:
modal run modal_backend.py --image-path assets/sample_menu.jpg
This creates out.wav after a real vision and TTS pass.
Verification status
- Local mock UI and utility tests can run without cloud credentials.
- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
- Cohere STT additionally requires gated-model access and
HF_TOKEN. - No training is required; all three stages use pretrained weights.
- Exact model calls and constraints are recorded in
MODEL_VERIFICATION.md.
Credits
Built with OpenAI Codex.