Spaces:

build-small-hackathon
/

iris

Running on Zero

App Files Files Community

iris / README.md

nextmarte

docs: claim best-agent with an explicit perception-action architecture section

7ea2a91 about 12 hours ago

preview code

raw

history blame contribute delete

4.6 kB

	---
	title: Iris
	emoji: 👁️
	colorFrom: indigo
	colorTo: yellow
	sdk: gradio
	sdk_version: 6.17.3
	app_file: app.py
	pinned: true
	license: apache-2.0
	short_description: Your father's eyes, by voice. Reads bills & money aloud.
	tags:
	- backyard-ai
	- tiny-titan
	- off-brand
	- off-the-grid
	- best-demo
	- best-agent
	- sharing-is-caring
	- community-choice
	---

	# 👁️ Iris: your father's eyes, by voice

	> Built for the Build Small Hackathon · Backyard AI track · for my father, who is blind.

	Try it live: https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone)
	How it was built (agent trace): https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace
	Demo video: _‹add link›_ · Social post: _‹add link›_

	Iris is a voice-first assistant for blind and low-vision people. Open it on a phone,
	point the camera, and it tells you what's around you, out loud, in your language.
	The whole screen is the button, so there's nothing small to aim for. In live
	mode it just listens and answers.

	## What it does
	- 👁️ Describe: tap anywhere. "A table ahead with a mug on the right."
	- 🎤 Ask, hands-free: in live mode, just speak. "What color is this shirt?", "read this label", "is anyone here?"
	- 💵 Read money and bills: "how much do I have?" counts the banknotes. Point at an electricity bill and it reads the amount and due date.
	- 💊 Read medicine: reads the dose and instructions on a box, exactly as written.
	- 📡 Live mode: double-tap, or say "live mode". Iris describes the scene once, then speaks up only when something new comes into view.

	## How to use it
	- Tap anywhere → describe what's in front of you.
	- Hold → ask a question (release to send).
	- Double-tap → toggle live mode (hands-free listening + new-thing alerts). Say "stop" to turn it off.
	- First run: choose your language by voice ("say your language"). Language & accessibility toggles sit in the top corners.

	## Built for a blind user first
	Accessibility shaped the whole interface, because the person it was made for asked for it:
	- The whole screen is one button. Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus.
	- It talks first. A spoken welcome on the first tap, and you choose your language by voice.
	- Hands-free. In live mode it listens continuously, so there are no buttons to press.
	- For low vision too: large buttons with clear labels and real SVG icons, plus a high-contrast and larger-text mode.
	- Standards: keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings.

	## How it works: small models only, ≤ 32B total
	\| Stage \| Model \| Params \|
	\|---\|---\|---\|
	\| Speech-to-text \| Whisper small (faster-whisper) \| ~0.24B \|
	\| Vision-language \| Qwen3-VL-2B-Instruct \| ~2B \|
	\| Text-to-speech \| Piper (pt_BR / en_US) \| <1B \|

	About 2.5B total, every model is ≤ 4B (Tiny Titan). The
	voice-first frontend is custom, built on `gr.Server` (Off-Brand). Inference runs
	in the Space on ZeroGPU, with no third-party model APIs.

	## Architecture: a small perception-action agent
	Iris is more than one model call. It orchestrates four tools and runs a control loop:
	- Role prompts define what each model does: read money and bills, describe a scene for a blind person, report only what is new.
	- Intent routing turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors).
	- Tools it drives: Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change.
	- A live loop that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat.

	## Safety
	Iris describes surroundings and reads text. Don't use it to get around or avoid
	obstacles. It can't judge distance reliably and isn't safe to walk by.

	## Run locally
	```bash
	pip install -r requirements.txt
	IRIS_WARMUP=1 python app.py # http://localhost:7860 (warmup preloads the models)
	```

	## Credits
	Built by Marcus Ramalho for his father Marcos, with Claude Code (Claude Opus 4.8).
	The build is documented as an open [agent trace](https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace).
	STT: OpenAI Whisper (via faster-whisper) · Vision: Qwen3-VL · TTS: Piper · UI: Gradio (`gr.Server`).