File size: 5,734 Bytes
031e3f9
 
 
 
 
 
 
 
 
2d469d4
 
 
 
 
 
 
 
 
 
 
 
031e3f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d469d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---

title: Third Eye
emoji: "\U0001F441"
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
tags:
  - hackathon
  - build-small
  - backyard-ai
  - accessibility
  - blind
  - qwen2-vl
  - openbmb/VoxCPM2
  - CohereLabs/cohere-transcribe-03-2026
  - multimodal
  - voice-assistant
---

# Third Eye

Third Eye is a voice-first visual assistant for blind and low-vision people. Point a
camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer
without typing.

## How to use

1. Open **Describe**, **Ask**, or **Read Text**.
2. Capture a webcam image, upload one, or select a bundled example.
3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.

The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
complete user interface without uploading images. Real inference activates automatically
when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured.

## Models

| Stage | Model | Parameters |
|---|---|---:|
| Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B |
| Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B |
| Speech synthesis | `openbmb/VoxCPM2` | 2.29B |

The vision model is 3B parameters and stays below the 4B limit. It is bilingual in
English and Chinese and has strong document/OCR performance for menus, labels, and signs.

`Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy
Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
three models share one runtime β€” required for the single-environment ZeroGPU deployment.

## Architecture

The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
Inference is routed through a small backend abstraction (`app.infer`) with three
interchangeable backends, auto-selected at runtime:

- **ZeroGPU** (`zerogpu_backend.py`) β€” all three models run in-process on a Hugging Face
  ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout.
- **Modal** (`modal_backend.py`) β€” three separately versioned Modal A10G functions with a
  shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present.
- **Preview (mock)** β€” runs the full interface with no GPU and never uploads the image.
  Active locally when no GPU backend is detected.

## Accessibility and Iris

Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings,
high-contrast glass panels, large targets, reduced-motion support, and a persistent textual
status. Its visual state moves through listening, seeing, thinking, and speaking while the
same state is exposed as text for screen-reader users.

## On-device roadmap

The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships
official GGUF and quantized variants, making an offline visual path technically credible, but
VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work.
The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured
memory, latency, battery, and quality results for the full stack. No on-device runtime is
claimed here.

## Run locally

```bash

python -m venv .venv

.venv\Scripts\activate

pip install -r requirements.txt

python app.py

```

Mock mode is automatic without credentials. To force it:

```bash

set THIRD_EYE_MOCK=true

python app.py

```

On Windows, the canonical launcher is:

```powershell

.\start.ps1

```

It defaults to `0.0.0.0:7860`, and you can override the bind address with
`THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`.

## Run on Hugging Face ZeroGPU

This Space is built to run all inference in-process on ZeroGPU β€” no external GPU service.

1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings.
2. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
3. Add an `HF_TOKEN` Space secret with access to that gated model.
4. Push this repo. `requirements.txt` installs the full model stack; the app
   auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`).

Models lazy-load on first use, so the first request of each kind is slower while weights
download and warm up. Use the **Diagnostics β†’ Pre-load models** button to warm them ahead
of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`.

## Deploy the Modal backend

1. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`.
3. Authenticate Modal locally.
4. Deploy the backend:

```bash

modal deploy modal_backend.py

```

5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets.

Run the remote smoke test after deployment:

```bash

modal run modal_backend.py --image-path assets/sample_menu.jpg

```

This creates `out.wav` after a real vision and TTS pass.

## Verification status

- Local mock UI and utility tests can run without cloud credentials.
- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
- Cohere STT additionally requires gated-model access and `HF_TOKEN`.
- No training is required; all three stages use pretrained weights.
- Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`.

## Credits

Built with OpenAI Codex.