---
title: Third Eye
emoji: "\U0001F441"
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
tags:
  - hackathon
  - build-small
  - backyard-ai
  - accessibility
  - blind
  - qwen2-vl
  - openbmb/VoxCPM2
  - CohereLabs/cohere-transcribe-03-2026
  - multimodal
  - voice-assistant
---

# Third Eye

Third Eye is a voice-first visual assistant for blind and low-vision people. Point a
camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer
without typing.

## How to use

1. Open **Describe**, **Ask**, or **Read Text**.
2. Capture a webcam image, upload one, or select a bundled example.
3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.

The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
complete user interface without uploading images. Real inference activates automatically
when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured.

## Models

| Stage | Model | Parameters |
|---|---|---:|
| Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B |
| Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B |
| Speech synthesis | `openbmb/VoxCPM2` | 2.29B |

The vision model is 3B parameters and stays below the 4B limit. It is bilingual in
English and Chinese and has strong document/OCR performance for menus, labels, and signs.

`Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy
Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
three models share one runtime — required for the single-environment ZeroGPU deployment.

## Architecture

The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
Inference is routed through a small backend abstraction (`app.infer`) with three
interchangeable backends, auto-selected at runtime:

- **ZeroGPU** (`zerogpu_backend.py`) — all three models run in-process on a Hugging Face
  ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout.
- **Modal** (`modal_backend.py`) — three separately versioned Modal A10G functions with a
  shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present.
- **Preview (mock)** — runs the full interface with no GPU and never uploads the image.
  Active locally when no GPU backend is detected.

## Accessibility and Iris

Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings,
high-contrast glass panels, large targets, reduced-motion support, and a persistent textual
status. Its visual state moves through listening, seeing, thinking, and speaking while the
same state is exposed as text for screen-reader users.

## On-device roadmap

The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships
official GGUF and quantized variants, making an offline visual path technically credible, but
VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work.
The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured
memory, latency, battery, and quality results for the full stack. No on-device runtime is
claimed here.

## Run locally

```bash
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python app.py
```

Mock mode is automatic without credentials. To force it:

```bash
set THIRD_EYE_MOCK=true
python app.py
```

On Windows, the canonical launcher is:

```powershell
.\start.ps1
```

It defaults to `0.0.0.0:7860`, and you can override the bind address with
`THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`.

## Run on Hugging Face ZeroGPU

This Space is built to run all inference in-process on ZeroGPU — no external GPU service.

1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings.
2. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
3. Add an `HF_TOKEN` Space secret with access to that gated model.
4. Push this repo. `requirements.txt` installs the full model stack; the app
   auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`).

Models lazy-load on first use, so the first request of each kind is slower while weights
download and warm up. Use the **Diagnostics → Pre-load models** button to warm them ahead
of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`.

## Deploy the Modal backend

1. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`.
3. Authenticate Modal locally.
4. Deploy the backend:

```bash
modal deploy modal_backend.py
```

5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets.

Run the remote smoke test after deployment:

```bash
modal run modal_backend.py --image-path assets/sample_menu.jpg
```

This creates `out.wav` after a real vision and TTS pass.

## Verification status

- Local mock UI and utility tests can run without cloud credentials.
- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
- Cohere STT additionally requires gated-model access and `HF_TOKEN`.
- No training is required; all three stages use pretrained weights.
- Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`.

## Credits

Built with OpenAI Codex.