iris / README.md
nextmarte's picture
docs: claim best-agent with an explicit perception-action architecture section
7ea2a91
---
title: Iris
emoji: 👁️
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: true
license: apache-2.0
short_description: Your father's eyes, by voice. Reads bills & money aloud.
tags:
- backyard-ai
- tiny-titan
- off-brand
- off-the-grid
- best-demo
- best-agent
- sharing-is-caring
- community-choice
---
# 👁️ Iris: your father's eyes, by voice
> Built for the **Build Small Hackathon** · **Backyard AI** track · for my father, who is blind.
**Try it live:** https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone)
**How it was built (agent trace):** https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace
**Demo video:** _‹add link›_ · **Social post:** _‹add link›_
Iris is a voice-first assistant for blind and low-vision people. Open it on a phone,
point the camera, and it tells you what's around you, out loud, in your language.
**The whole screen is the button**, so there's nothing small to aim for. In live
mode it just listens and answers.
## What it does
- 👁️ **Describe**: tap anywhere. *"A table ahead with a mug on the right."*
- 🎤 **Ask, hands-free**: in live mode, just speak. *"What color is this shirt?"*, *"read this label"*, *"is anyone here?"*
- 💵 **Read money and bills**: *"how much do I have?"* counts the banknotes. Point at an electricity bill and it reads the **amount and due date**.
- 💊 **Read medicine**: reads the dose and instructions on a box, exactly as written.
- 📡 **Live mode**: double-tap, or say *"live mode"*. Iris describes the scene once, then speaks up only when something new comes into view.
## How to use it
- **Tap** anywhere → describe what's in front of you.
- **Hold** → ask a question (release to send).
- **Double-tap** → toggle live mode (hands-free listening + new-thing alerts). Say *"stop"* to turn it off.
- First run: **choose your language by voice** ("say your language"). Language & accessibility toggles sit in the top corners.
## Built for a blind user first
Accessibility shaped the whole interface, because the person it was made for asked for it:
- **The whole screen is one button.** Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus.
- **It talks first.** A spoken welcome on the first tap, and you **choose your language by voice**.
- **Hands-free.** In live mode it listens continuously, so there are no buttons to press.
- **For low vision too:** large buttons with clear labels and real SVG icons, plus a **high-contrast and larger-text** mode.
- **Standards:** keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings.
## How it works: small models only, ≤ 32B total
| Stage | Model | Params |
|---|---|---|
| Speech-to-text | Whisper small (faster-whisper) | ~0.24B |
| Vision-language | **Qwen3-VL-2B-Instruct** | ~2B |
| Text-to-speech | Piper (pt_BR / en_US) | <1B |
**About 2.5B total**, **every model is ≤ 4B** (Tiny Titan). The
voice-first frontend is custom, built on **`gr.Server`** (Off-Brand). Inference runs
in the Space on **ZeroGPU**, with no third-party model APIs.
## Architecture: a small perception-action agent
Iris is more than one model call. It orchestrates four tools and runs a control loop:
- **Role prompts** define what each model does: read money and bills, describe a scene for a blind person, report only what is new.
- **Intent routing** turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors).
- **Tools it drives:** Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change.
- **A live loop** that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat.
## Safety
Iris describes surroundings and reads text. Don't use it to get around or avoid
obstacles. It can't judge distance reliably and isn't safe to walk by.
## Run locally
```bash
pip install -r requirements.txt
IRIS_WARMUP=1 python app.py # http://localhost:7860 (warmup preloads the models)
```
## Credits
Built by **Marcus Ramalho** for his father Marcos, with **Claude Code (Claude Opus 4.8)**.
The build is documented as an open [agent trace](https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace).
STT: OpenAI Whisper (via faster-whisper) · Vision: **Qwen3-VL** · TTS: **Piper** · UI: Gradio (`gr.Server`).