Spaces:

build-small-hackathon
/

iris

Running on Zero

File size: 4,604 Bytes

26dae50
 
 
 
98471a4
26dae50
 
 
df6b3ac
26dae50
98471a4
df6b3ac
 
 
 
46a2840
df6b3ac
7ea2a91
04c91be
46a2840
26dae50
 
5d19f12
26dae50
df6b3ac
26dae50
6320689
04c91be
df6b3ac
26dae50
df6b3ac
5d19f12
 
 
26dae50
df6b3ac
5d19f12
 
 
 
 
26dae50
df6b3ac
 
 
 
 
 
5d19f12
 
 
 
 
 
 
df6b3ac
5d19f12
df6b3ac
 
 
 
 
 
46a2840
5d19f12
 
26dae50
7ea2a91
 
 
 
 
 
 
5d19f12
 
 
26dae50
 
 
 
df6b3ac
26dae50
df6b3ac
 
46a2840
1f2922c
df6b3ac

---
title: Iris
emoji: 👁️
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: true
license: apache-2.0
short_description: Your father's eyes, by voice. Reads bills & money aloud.
tags:
  - backyard-ai
  - tiny-titan
  - off-brand
  - off-the-grid
  - best-demo
  - best-agent
  - sharing-is-caring
  - community-choice
---

# 👁️ Iris: your father's eyes, by voice

> Built for the **Build Small Hackathon** · **Backyard AI** track · for my father, who is blind.

**Try it live:** https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone)
**How it was built (agent trace):** https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace
**Demo video:** _‹add link›_ · **Social post:** _‹add link›_

Iris is a voice-first assistant for blind and low-vision people. Open it on a phone,
point the camera, and it tells you what's around you, out loud, in your language.
**The whole screen is the button**, so there's nothing small to aim for. In live
mode it just listens and answers.

## What it does
- 👁️ **Describe**: tap anywhere. *"A table ahead with a mug on the right."*
- 🎤 **Ask, hands-free**: in live mode, just speak. *"What color is this shirt?"*, *"read this label"*, *"is anyone here?"*
- 💵 **Read money and bills**: *"how much do I have?"* counts the banknotes. Point at an electricity bill and it reads the **amount and due date**.
- 💊 **Read medicine**: reads the dose and instructions on a box, exactly as written.
- 📡 **Live mode**: double-tap, or say *"live mode"*. Iris describes the scene once, then speaks up only when something new comes into view.

## How to use it
- **Tap** anywhere → describe what's in front of you.
- **Hold** → ask a question (release to send).
- **Double-tap** → toggle live mode (hands-free listening + new-thing alerts). Say *"stop"* to turn it off.
- First run: **choose your language by voice** ("say your language"). Language & accessibility toggles sit in the top corners.

## Built for a blind user first
Accessibility shaped the whole interface, because the person it was made for asked for it:
- **The whole screen is one button.** Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus.
- **It talks first.** A spoken welcome on the first tap, and you **choose your language by voice**.
- **Hands-free.** In live mode it listens continuously, so there are no buttons to press.
- **For low vision too:** large buttons with clear labels and real SVG icons, plus a **high-contrast and larger-text** mode.
- **Standards:** keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings.

## How it works: small models only, ≤ 32B total
| Stage | Model | Params |
|---|---|---|
| Speech-to-text | Whisper small (faster-whisper) | ~0.24B |
| Vision-language | **Qwen3-VL-2B-Instruct** | ~2B |
| Text-to-speech | Piper (pt_BR / en_US) | <1B |

**About 2.5B total**, **every model is ≤ 4B** (Tiny Titan). The
voice-first frontend is custom, built on **`gr.Server`** (Off-Brand). Inference runs
in the Space on **ZeroGPU**, with no third-party model APIs.

## Architecture: a small perception-action agent
Iris is more than one model call. It orchestrates four tools and runs a control loop:
- **Role prompts** define what each model does: read money and bills, describe a scene for a blind person, report only what is new.
- **Intent routing** turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors).
- **Tools it drives:** Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change.
- **A live loop** that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat.

## Safety
Iris describes surroundings and reads text. Don't use it to get around or avoid
obstacles. It can't judge distance reliably and isn't safe to walk by.

## Run locally
```bash
pip install -r requirements.txt
IRIS_WARMUP=1 python app.py     # http://localhost:7860  (warmup preloads the models)
```

## Credits
Built by **Marcus Ramalho** for his father Marcos, with **Claude Code (Claude Opus 4.8)**.
The build is documented as an open [agent trace](https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace).
STT: OpenAI Whisper (via faster-whisper) · Vision: **Qwen3-VL** · TTS: **Piper** · UI: Gradio (`gr.Server`).