third-eye / README.md
mitvho09's picture
Deploy accessible futurist refresh
2d469d4 verified
|
Raw
History Blame Contribute Delete
5.73 kB
---
title: Third Eye
emoji: "\U0001F441"
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
tags:
- hackathon
- build-small
- backyard-ai
- accessibility
- blind
- qwen2-vl
- openbmb/VoxCPM2
- CohereLabs/cohere-transcribe-03-2026
- multimodal
- voice-assistant
---
# Third Eye
Third Eye is a voice-first visual assistant for blind and low-vision people. Point a
camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer
without typing.
## How to use
1. Open **Describe**, **Ask**, or **Read Text**.
2. Capture a webcam image, upload one, or select a bundled example.
3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.
The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
complete user interface without uploading images. Real inference activates automatically
when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured.
## Models
| Stage | Model | Parameters |
|---|---|---:|
| Vision and OCR | `Qwen/Qwen2.5-VL-3B-Instruct` | 3B |
| Speech recognition | `CohereLabs/cohere-transcribe-03-2026` | 2.07B |
| Speech synthesis | `openbmb/VoxCPM2` | 2.29B |
The vision model is 3B parameters and stays below the 4B limit. It is bilingual in
English and Chinese and has strong document/OCR performance for menus, labels, and signs.
`Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy
Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
three models share one runtime β€” required for the single-environment ZeroGPU deployment.
## Architecture
The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
Inference is routed through a small backend abstraction (`app.infer`) with three
interchangeable backends, auto-selected at runtime:
- **ZeroGPU** (`zerogpu_backend.py`) β€” all three models run in-process on a Hugging Face
ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout.
- **Modal** (`modal_backend.py`) β€” three separately versioned Modal A10G functions with a
shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present.
- **Preview (mock)** β€” runs the full interface with no GPU and never uploads the image.
Active locally when no GPU backend is detected.
## Accessibility and Iris
Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings,
high-contrast glass panels, large targets, reduced-motion support, and a persistent textual
status. Its visual state moves through listening, seeing, thinking, and speaking while the
same state is exposed as text for screen-reader users.
## On-device roadmap
The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships
official GGUF and quantized variants, making an offline visual path technically credible, but
VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work.
The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured
memory, latency, battery, and quality results for the full stack. No on-device runtime is
claimed here.
## Run locally
```bash
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python app.py
```
Mock mode is automatic without credentials. To force it:
```bash
set THIRD_EYE_MOCK=true
python app.py
```
On Windows, the canonical launcher is:
```powershell
.\start.ps1
```
It defaults to `0.0.0.0:7860`, and you can override the bind address with
`THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`.
## Run on Hugging Face ZeroGPU
This Space is built to run all inference in-process on ZeroGPU β€” no external GPU service.
1. Create a Gradio Space and set its hardware to **ZeroGPU** in the Space settings.
2. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
3. Add an `HF_TOKEN` Space secret with access to that gated model.
4. Push this repo. `requirements.txt` installs the full model stack; the app
auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`).
Models lazy-load on first use, so the first request of each kind is slower while weights
download and warm up. Use the **Diagnostics β†’ Pre-load models** button to warm them ahead
of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu|modal|mock`.
## Deploy the Modal backend
1. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`.
3. Authenticate Modal locally.
4. Deploy the backend:
```bash
modal deploy modal_backend.py
```
5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets.
Run the remote smoke test after deployment:
```bash
modal run modal_backend.py --image-path assets/sample_menu.jpg
```
This creates `out.wav` after a real vision and TTS pass.
## Verification status
- Local mock UI and utility tests can run without cloud credentials.
- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
- Cohere STT additionally requires gated-model access and `HF_TOKEN`.
- No training is required; all three stages use pretrained weights.
- Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`.
## Credits
Built with OpenAI Codex.