Spaces:

build-small-hackathon
/

third-eye

Sleeping

App Files Files Community

third-eye / README.md

mitvho09

Deploy accessible futurist refresh

2d469d4 verified 17 days ago

preview code

Raw

History Blame Contribute Delete

5.73 kB

	---
	title: Third Eye
	emoji: "\U0001F441"
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: "5.50.0"
	app_file: app.py
	pinned: false
	tags:
	- hackathon
	- build-small
	- backyard-ai
	- accessibility
	- blind
	- qwen2-vl
	- openbmb/VoxCPM2
	- CohereLabs/cohere-transcribe-03-2026
	- multimodal
	- voice-assistant
	---

	# Third Eye

	Third Eye is a voice-first visual assistant for blind and low-vision people. Point a
	camera at a menu, medicine label, sign, or scene, ask a question, and hear the answer
	without typing.

	## How to use

	1. Open Describe, Ask, or Read Text.
	2. Capture a webcam image, upload one, or select a bundled example.
	3. Speak a question in Ask mode, or use the typed fallback if the microphone is unavailable.
	4. Choose English or Chinese, then listen to the answer and read the high-contrast transcript.

	The Space starts in mock mode when Modal credentials are absent. Mock mode validates the
	complete user interface without uploading images. Real inference activates automatically
	when `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` are configured.

	## Models

	\| Stage \| Model \| Parameters \|
	\|---\|---\|---:\|
	\| Vision and OCR \| `Qwen/Qwen2.5-VL-3B-Instruct` \| 3B \|
	\| Speech recognition \| `CohereLabs/cohere-transcribe-03-2026` \| 2.07B \|
	\| Speech synthesis \| `openbmb/VoxCPM2` \| 2.29B \|

	The vision model is 3B parameters and stays below the 4B limit. It is bilingual in
	English and Chinese and has strong document/OCR performance for menus, labels, and signs.

	`Qwen2.5-VL` replaced the earlier `openbmb/MiniCPM-V-2`. MiniCPM-V-2 pins a legacy
	Transformers stack, which cannot coexist with Cohere Transcribe (Transformers 5.4+) in a
	single environment. Qwen2.5-VL runs on the same modern Transformers as Cohere, so all
	three models share one runtime — required for the single-environment ZeroGPU deployment.

	## Architecture

	The Gradio app handles webcam, microphone, accessibility state, and pipeline orchestration.
	Inference is routed through a small backend abstraction (`app.infer`) with three
	interchangeable backends, auto-selected at runtime:

	- ZeroGPU (`zerogpu_backend.py`) — all three models run in-process on a Hugging Face
	ZeroGPU slice via `@spaces.GPU`. One environment, modern Transformers throughout.
	- Modal (`modal_backend.py`) — three separately versioned Modal A10G functions with a
	shared weight cache. Selected when `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` are present.
	- Preview (mock) — runs the full interface with no GPU and never uploads the image.
	Active locally when no GPU backend is detected.

	## Accessibility and Iris

	Iris presents one large action per task, 20px base text, 24px answer text, strong focus rings,
	high-contrast glass panels, large targets, reduced-motion support, and a persistent textual
	status. Its visual state moves through listening, seeing, thinking, and speaking while the
	same state is exposed as text for screen-reader users.

	## On-device roadmap

	The app runs on hosted GPU (ZeroGPU or Modal). It is not a phone build. Qwen2.5-VL ships
	official GGUF and quantized variants, making an offline visual path technically credible, but
	VoxCPM2 and Cohere Transcribe still require device-specific profiling and conversion work.
	The next milestone is an int4 Qwen2.5-VL proof on a recent Android device, followed by measured
	memory, latency, battery, and quality results for the full stack. No on-device runtime is
	claimed here.

	## Run locally

	```bash
	python -m venv .venv
	.venv\Scripts\activate
	pip install -r requirements.txt
	python app.py
	```

	Mock mode is automatic without credentials. To force it:

	```bash
	set THIRD_EYE_MOCK=true
	python app.py
	```

	On Windows, the canonical launcher is:

	```powershell
	.\start.ps1
	```

	It defaults to `0.0.0.0:7860`, and you can override the bind address with
	`THIRD_EYE_HOST` or the port with `THIRD_EYE_PORT` / `PORT`.

	## Run on Hugging Face ZeroGPU

	This Space is built to run all inference in-process on ZeroGPU — no external GPU service.

	1. Create a Gradio Space and set its hardware to ZeroGPU in the Space settings.
	2. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
	3. Add an `HF_TOKEN` Space secret with access to that gated model.
	4. Push this repo. `requirements.txt` installs the full model stack; the app
	auto-detects the `spaces` runtime and serves live inference (`THIRD_EYE_BACKEND=auto`).

	Models lazy-load on first use, so the first request of each kind is slower while weights
	download and warm up. Use the Diagnostics → Pre-load models button to warm them ahead
	of a demo. Force a backend explicitly with `THIRD_EYE_BACKEND=zerogpu\|modal\|mock`.

	## Deploy the Modal backend

	1. Accept access to `CohereLabs/cohere-transcribe-03-2026`.
	2. Create a Modal secret named `third-eye-hf` containing `HF_TOKEN`.
	3. Authenticate Modal locally.
	4. Deploy the backend:

	```bash
	modal deploy modal_backend.py
	```

	5. Add `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` as Hugging Face Space secrets.

	Run the remote smoke test after deployment:

	```bash
	modal run modal_backend.py --image-path assets/sample_menu.jpg
	```

	This creates `out.wav` after a real vision and TTS pass.

	## Verification status

	- Local mock UI and utility tests can run without cloud credentials.
	- Real vision, TTS, and STT require a GPU backend (ZeroGPU or Modal).
	- Cohere STT additionally requires gated-model access and `HF_TOKEN`.
	- No training is required; all three stages use pretrained weights.
	- Exact model calls and constraints are recorded in `MODEL_VERIFICATION.md`.

	## Credits

	Built with OpenAI Codex.