Spaces:
Sleeping
Sleeping
File size: 5,734 Bytes
031e3f9 2d469d4 031e3f9 2d469d4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | ---
title: Third Eye
emoji: "\U0001F441"
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
tags:
- hackathon
- build-small
- backyard-ai
- accessibility
- blind
- qwen2-vl
- openbmb/VoxCPM2
- CohereLabs/cohere-transcribe-03-2026
- multimodal
- voice-assistant
---
# Third Eye
Third Eye is a voice-first visual assistant for blind and low-vision people. Point a
camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer
without typing.
## How to use
1. Open **Describe**, **Ask**, or **Read Text**.
2. Capture a webcam image, upload one, or select a bundled example.
3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.
The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
complete user interface without uploading images. Real inference activates automatically
when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured.
## Models
| Stage | Model | Parameters |
|---|---|---:|
| Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B |
| Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B |
| Speech synthesis | `openbmb/VoxCPM2` | 2.29B |
The vision model is 3B parameters and stays below the 4B limit. It is bilingual in
English and Chinese and has strong document/OCR performance for menus, labels, and signs.
`Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy
Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
three models share one runtime β required for the single-environment ZeroGPU deployment.
## Architecture
The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
Inference is routed through a small backend abstraction (`app.infer`) with three
interchangeable backends, auto-selected at runtime:
- **ZeroGPU** (`zerogpu_backend.py`) β all three models run in-process on a Hugging Face
ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout.
- **Modal** (`modal_backend.py`) β three separately versioned Modal A10G functions with a
shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present.
- **Preview (mock)** β runs the full interface with no GPU and never uploads the image.
Active locally when no GPU backend is detected.
## Accessibility and Iris
Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings,
high-contrast glass panels, large targets, reduced-motion support, and a persistent textual
status. Its visual state moves through listening, seeing, thinking, and speaking while the
same state is exposed as text for screen-reader users.
## On-device roadmap
The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships
official GGUF and quantized variants, making an offline visual path technically credible, but
VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work.
The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured
memory, latency, battery, and quality results for the full stack. No on-device runtime is
claimed here.
## Run locally
```bash
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python app.py
```
Mock mode is automatic without credentials. To force it:
```bash
set THIRD_EYE_MOCK=true
python app.py
```
On Windows, the canonical launcher is:
```powershell
.\start.ps1
```
It defaults to `0.0.0.0:7860`, and you can override the bind address with
`THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`.
## Run on Hugging Face ZeroGPU
This Space is built to run all inference in-process on ZeroGPU β no external GPU service.
1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings.
2. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
3. Add an `HF_TOKEN` Space secret with access to that gated model.
4. Push this repo. `requirements.txt` installs the full model stack; the app
auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`).
Models lazy-load on first use, so the first request of each kind is slower while weights
download and warm up. Use the **Diagnostics β Pre-load models** button to warm them ahead
of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`.
## Deploy the Modal backend
1. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`.
3. Authenticate Modal locally.
4. Deploy the backend:
```bash
modal deploy modal_backend.py
```
5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets.
Run the remote smoke test after deployment:
```bash
modal run modal_backend.py --image-path assets/sample_menu.jpg
```
This creates `out.wav` after a real vision and TTS pass.
## Verification status
- Local mock UI and utility tests can run without cloud credentials.
- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
- Cohere STT additionally requires gated-model access and `HF_TOKEN`.
- No training is required; all three stages use pretrained weights.
- Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`.
## Credits
Built with OpenAI Codex.
|