--- title: Iris emoji: 👁️ colorFrom: indigo colorTo: yellow sdk: gradio sdk_version: 6.17.3 app_file: app.py pinned: true license: apache-2.0 short_description: Your father's eyes, by voice. Reads bills & money aloud. tags: - backyard-ai - tiny-titan - off-brand - off-the-grid - best-demo - best-agent - sharing-is-caring - community-choice --- # 👁️ Iris: your father's eyes, by voice > Built for the **Build Small Hackathon** · **Backyard AI** track · for my father, who is blind. **Try it live:** https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone) **How it was built (agent trace):** https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace **Demo video:** _‹add link›_ · **Social post:** _‹add link›_ Iris is a voice-first assistant for blind and low-vision people. Open it on a phone, point the camera, and it tells you what's around you, out loud, in your language. **The whole screen is the button**, so there's nothing small to aim for. In live mode it just listens and answers. ## What it does - 👁️ **Describe**: tap anywhere. *"A table ahead with a mug on the right."* - 🎤 **Ask, hands-free**: in live mode, just speak. *"What color is this shirt?"*, *"read this label"*, *"is anyone here?"* - 💵 **Read money and bills**: *"how much do I have?"* counts the banknotes. Point at an electricity bill and it reads the **amount and due date**. - 💊 **Read medicine**: reads the dose and instructions on a box, exactly as written. - 📡 **Live mode**: double-tap, or say *"live mode"*. Iris describes the scene once, then speaks up only when something new comes into view. ## How to use it - **Tap** anywhere → describe what's in front of you. - **Hold** → ask a question (release to send). - **Double-tap** → toggle live mode (hands-free listening + new-thing alerts). Say *"stop"* to turn it off. - First run: **choose your language by voice** ("say your language"). Language & accessibility toggles sit in the top corners. ## Built for a blind user first Accessibility shaped the whole interface, because the person it was made for asked for it: - **The whole screen is one button.** Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus. - **It talks first.** A spoken welcome on the first tap, and you **choose your language by voice**. - **Hands-free.** In live mode it listens continuously, so there are no buttons to press. - **For low vision too:** large buttons with clear labels and real SVG icons, plus a **high-contrast and larger-text** mode. - **Standards:** keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings. ## How it works: small models only, ≤ 32B total | Stage | Model | Params | |---|---|---| | Speech-to-text | Whisper small (faster-whisper) | ~0.24B | | Vision-language | **Qwen3-VL-2B-Instruct** | ~2B | | Text-to-speech | Piper (pt_BR / en_US) | <1B | **About 2.5B total**, **every model is ≤ 4B** (Tiny Titan). The voice-first frontend is custom, built on **`gr.Server`** (Off-Brand). Inference runs in the Space on **ZeroGPU**, with no third-party model APIs. ## Architecture: a small perception-action agent Iris is more than one model call. It orchestrates four tools and runs a control loop: - **Role prompts** define what each model does: read money and bills, describe a scene for a blind person, report only what is new. - **Intent routing** turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors). - **Tools it drives:** Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change. - **A live loop** that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat. ## Safety Iris describes surroundings and reads text. Don't use it to get around or avoid obstacles. It can't judge distance reliably and isn't safe to walk by. ## Run locally ```bash pip install -r requirements.txt IRIS_WARMUP=1 python app.py # http://localhost:7860 (warmup preloads the models) ``` ## Credits Built by **Marcus Ramalho** for his father Marcos, with **Claude Code (Claude Opus 4.8)**. The build is documented as an open [agent trace](https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace). STT: OpenAI Whisper (via faster-whisper) · Vision: **Qwen3-VL** · TTS: **Piper** · UI: Gradio (`gr.Server`).