Spaces:
Running on Zero
Running on Zero
| title: Iris | |
| emoji: 👁️ | |
| colorFrom: indigo | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 6.17.3 | |
| app_file: app.py | |
| pinned: true | |
| license: apache-2.0 | |
| short_description: Your father's eyes, by voice. Reads bills & money aloud. | |
| tags: | |
| - backyard-ai | |
| - tiny-titan | |
| - off-brand | |
| - off-the-grid | |
| - best-demo | |
| - best-agent | |
| - sharing-is-caring | |
| - community-choice | |
| # 👁️ Iris: your father's eyes, by voice | |
| > Built for the **Build Small Hackathon** · **Backyard AI** track · for my father, who is blind. | |
| **Try it live:** https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone) | |
| **How it was built (agent trace):** https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace | |
| **Demo video:** _‹add link›_ · **Social post:** _‹add link›_ | |
| Iris is a voice-first assistant for blind and low-vision people. Open it on a phone, | |
| point the camera, and it tells you what's around you, out loud, in your language. | |
| **The whole screen is the button**, so there's nothing small to aim for. In live | |
| mode it just listens and answers. | |
| ## What it does | |
| - 👁️ **Describe**: tap anywhere. *"A table ahead with a mug on the right."* | |
| - 🎤 **Ask, hands-free**: in live mode, just speak. *"What color is this shirt?"*, *"read this label"*, *"is anyone here?"* | |
| - 💵 **Read money and bills**: *"how much do I have?"* counts the banknotes. Point at an electricity bill and it reads the **amount and due date**. | |
| - 💊 **Read medicine**: reads the dose and instructions on a box, exactly as written. | |
| - 📡 **Live mode**: double-tap, or say *"live mode"*. Iris describes the scene once, then speaks up only when something new comes into view. | |
| ## How to use it | |
| - **Tap** anywhere → describe what's in front of you. | |
| - **Hold** → ask a question (release to send). | |
| - **Double-tap** → toggle live mode (hands-free listening + new-thing alerts). Say *"stop"* to turn it off. | |
| - First run: **choose your language by voice** ("say your language"). Language & accessibility toggles sit in the top corners. | |
| ## Built for a blind user first | |
| Accessibility shaped the whole interface, because the person it was made for asked for it: | |
| - **The whole screen is one button.** Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus. | |
| - **It talks first.** A spoken welcome on the first tap, and you **choose your language by voice**. | |
| - **Hands-free.** In live mode it listens continuously, so there are no buttons to press. | |
| - **For low vision too:** large buttons with clear labels and real SVG icons, plus a **high-contrast and larger-text** mode. | |
| - **Standards:** keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings. | |
| ## How it works: small models only, ≤ 32B total | |
| | Stage | Model | Params | | |
| |---|---|---| | |
| | Speech-to-text | Whisper small (faster-whisper) | ~0.24B | | |
| | Vision-language | **Qwen3-VL-2B-Instruct** | ~2B | | |
| | Text-to-speech | Piper (pt_BR / en_US) | <1B | | |
| **About 2.5B total**, **every model is ≤ 4B** (Tiny Titan). The | |
| voice-first frontend is custom, built on **`gr.Server`** (Off-Brand). Inference runs | |
| in the Space on **ZeroGPU**, with no third-party model APIs. | |
| ## Architecture: a small perception-action agent | |
| Iris is more than one model call. It orchestrates four tools and runs a control loop: | |
| - **Role prompts** define what each model does: read money and bills, describe a scene for a blind person, report only what is new. | |
| - **Intent routing** turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors). | |
| - **Tools it drives:** Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change. | |
| - **A live loop** that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat. | |
| ## Safety | |
| Iris describes surroundings and reads text. Don't use it to get around or avoid | |
| obstacles. It can't judge distance reliably and isn't safe to walk by. | |
| ## Run locally | |
| ```bash | |
| pip install -r requirements.txt | |
| IRIS_WARMUP=1 python app.py # http://localhost:7860 (warmup preloads the models) | |
| ``` | |
| ## Credits | |
| Built by **Marcus Ramalho** for his father Marcos, with **Claude Code (Claude Opus 4.8)**. | |
| The build is documented as an open [agent trace](https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace). | |
| STT: OpenAI Whisper (via faster-whisper) · Vision: **Qwen3-VL** · TTS: **Piper** · UI: Gradio (`gr.Server`). | |