third-eye / README.md
mitvho09's picture
Deploy accessible futurist refresh
2d469d4 verified
|
Raw
History Blame Contribute Delete
5.73 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Third Eye
emoji: πŸ‘
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
tags:
  - hackathon
  - build-small
  - backyard-ai
  - accessibility
  - blind
  - qwen2-vl
  - openbmb/VoxCPM2
  - CohereLabs/cohere-transcribe-03-2026
  - multimodal
  - voice-assistant

Third Eye

Third Eye is a voice-first visual assistant for blind and low-vision people. Point a camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer without typing.

How to use

  1. Open Describe, Ask, or Read Text.
  2. Capture a webcam image, upload one, or select a bundled example.
  3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
  4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.

The Space starts in mock mode when Modal credentials are absent. Mock mode validates the complete user interface without uploading images. Real inference activates automatically when MODAL_TOKEN_ID and MODAL_TOKEN_SECRET are configured.

Models

Stage Model Parameters
Vision and OCR Qwen/Qwen2.5-VL-3B-Instruct 3B
Speech recognition CohereLabs/cohere-transcribe-03-2026 2.07B
Speech synthesis openbmb/VoxCPM2 2.29B

The vision model is 3B parameters and stays below the 4B limit. It is bilingual in English and Chinese and has strong document/OCR performance for menus, labels, and signs.

Qwen2.5-VL replaced the earlier openbmb/MiniCPM-V-2. MiniCPM-V-2 pins a legacy Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all three models share one runtime β€” required for the single-environment ZeroGPU deployment.

Architecture

The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration. Inference is routed through a small backend abstraction (app.infer) with three interchangeable backends, auto-selected at runtime:

  • ZeroGPU (zerogpu_backend.py) β€” all three models run in-process on a Hugging Face ZeroGPU slice via @spaces.GPU. One environment, modern Transformers throughout.
  • Modal (modal_backend.py) β€” three separately versioned Modal A10G functions with a shared weight cache. Selected when MODAL_TOKEN_ID / MODAL_TOKEN_SECRET are present.
  • Preview (mock) β€” runs the full interface with no GPU and never uploads the image. Active locally when no GPU backend is detected.

Accessibility and Iris

Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings, high-contrast glass panels, large targets, reduced-motion support, and a persistent textual status. Its visual state moves through listening, seeing, thinking, and speaking while the same state is exposed as text for screen-reader users.

On-device roadmap

The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships official GGUF and quantized variants, making an offline visual path technically credible, but VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work. The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured memory, latency, battery, and quality results for the full stack. No on-device runtime is claimed here.

Run locally

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python app.py

Mock mode is automatic without credentials. To force it:

set THIRD_EYE_MOCK=true
python app.py

On Windows, the canonical launcher is:

.\start.ps1

It defaults to 0.0.0.0:7860, and you can override the bind address with THIRD_EYE_HOST or the port with THIRD_EYE_PORT / PORT.

Run on Hugging Face ZeroGPU

This Space is built to run all inference in-process on ZeroGPU β€” no external GPU service.

  1. Create a Gradio Space and set its hardware to ZeroGPU in the Space settings.
  2. Accept access to CohereLabs/cohere-transcribe-03-2026.
  3. Add an HF_TOKEN Space secret with access to that gated model.
  4. Push this repo. requirements.txt installs the full model stack; the app auto-detects the spaces runtime and serves live inference (THIRD_EYE_BACKEND=auto).

Models lazy-load on first use, so the first request of each kind is slower while weights download and warm up. Use the Diagnostics β†’ Pre-load models button to warm them ahead of a demo. Force a backend explicitly with THIRD_EYE_BACKEND=zerogpu|modal|mock.

Deploy the Modal backend

  1. Accept access to CohereLabs/cohere-transcribe-03-2026.
  2. Create a Modal secret named third-eye-hf containing HF_TOKEN.
  3. Authenticate Modal locally.
  4. Deploy the backend:
modal deploy modal_backend.py
  1. Add MODAL_TOKEN_ID and MODAL_TOKEN_SECRET as Hugging Face Space secrets.

Run the remote smoke test after deployment:

modal run modal_backend.py --image-path assets/sample_menu.jpg

This creates out.wav after a real vision and TTS pass.

Verification status

  • Local mock UI and utility tests can run without cloud credentials.
  • Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
  • Cohere STT additionally requires gated-model access and HF_TOKEN.
  • No training is required; all three stages use pretrained weights.
  • Exact model calls and constraints are recorded in MODEL_VERIFICATION.md.

Credits

Built with OpenAI Codex.