iris / README.md
nextmarte's picture
docs: claim best-agent with an explicit perception-action architecture section
7ea2a91
metadata
title: Iris
emoji: 👁️
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: true
license: apache-2.0
short_description: Your father's eyes, by voice. Reads bills & money aloud.
tags:
  - backyard-ai
  - tiny-titan
  - off-brand
  - off-the-grid
  - best-demo
  - best-agent
  - sharing-is-caring
  - community-choice

👁️ Iris: your father's eyes, by voice

Built for the Build Small Hackathon · Backyard AI track · for my father, who is blind.

Try it live: https://huggingface.co/spaces/build-small-hackathon/iris (open on a phone) How it was built (agent trace): https://huggingface.co/datasets/build-small-hackathon/iris-agent-trace Demo video: ‹add link› · Social post: ‹add link›

Iris is a voice-first assistant for blind and low-vision people. Open it on a phone, point the camera, and it tells you what's around you, out loud, in your language. The whole screen is the button, so there's nothing small to aim for. In live mode it just listens and answers.

What it does

  • 👁️ Describe: tap anywhere. "A table ahead with a mug on the right."
  • 🎤 Ask, hands-free: in live mode, just speak. "What color is this shirt?", "read this label", "is anyone here?"
  • 💵 Read money and bills: "how much do I have?" counts the banknotes. Point at an electricity bill and it reads the amount and due date.
  • 💊 Read medicine: reads the dose and instructions on a box, exactly as written.
  • 📡 Live mode: double-tap, or say "live mode". Iris describes the scene once, then speaks up only when something new comes into view.

How to use it

  • Tap anywhere → describe what's in front of you.
  • Hold → ask a question (release to send).
  • Double-tap → toggle live mode (hands-free listening + new-thing alerts). Say "stop" to turn it off.
  • First run: choose your language by voice ("say your language"). Language & accessibility toggles sit in the top corners.

Built for a blind user first

Accessibility shaped the whole interface, because the person it was made for asked for it:

  • The whole screen is one button. Tap to describe, hold to ask, double-tap for live mode. Nothing small to find, no menus.
  • It talks first. A spoken welcome on the first tap, and you choose your language by voice.
  • Hands-free. In live mode it listens continuously, so there are no buttons to press.
  • For low vision too: large buttons with clear labels and real SVG icons, plus a high-contrast and larger-text mode.
  • Standards: keyboard focus rings, ARIA live regions, haptic feedback, and it honours the system's reduced-motion and contrast settings.

How it works: small models only, ≤ 32B total

Stage Model Params
Speech-to-text Whisper small (faster-whisper) ~0.24B
Vision-language Qwen3-VL-2B-Instruct ~2B
Text-to-speech Piper (pt_BR / en_US) <1B

About 2.5B total, every model is ≤ 4B (Tiny Titan). The voice-first frontend is custom, built on gr.Server (Off-Brand). Inference runs in the Space on ZeroGPU, with no third-party model APIs.

Architecture: a small perception-action agent

Iris is more than one model call. It orchestrates four tools and runs a control loop:

  • Role prompts define what each model does: read money and bills, describe a scene for a blind person, report only what is new.
  • Intent routing turns a spoken phrase into an action: describe, answer a question, or toggle live mode (forgiving of transcription errors).
  • Tools it drives: Whisper to hear, Qwen3-VL to see and read, Piper to speak, and an on-device detector (COCO-SSD) to watch for change.
  • A live loop that perceives (camera + detector), decides whether something new is worth saying, acts (calls the vision model and speaks), and remembers what it already said so it doesn't repeat.

Safety

Iris describes surroundings and reads text. Don't use it to get around or avoid obstacles. It can't judge distance reliably and isn't safe to walk by.

Run locally

pip install -r requirements.txt
IRIS_WARMUP=1 python app.py     # http://localhost:7860  (warmup preloads the models)

Credits

Built by Marcus Ramalho for his father Marcos, with Claude Code (Claude Opus 4.8). The build is documented as an open agent trace. STT: OpenAI Whisper (via faster-whisper) · Vision: Qwen3-VL · TTS: Piper · UI: Gradio (gr.Server).