Spaces:

build-small-hackathon
/

iris

Paused

Marcus Ramalho Claude Opus 4.8 commited on Jun 10

Commit

26dae50

0 Parent(s):

Iris: voice-first assistant for the blind (gr.Server + Qwen2.5-VL + Whisper + Piper)

- gr.Server backend: /describe API (image + optional voice -> PT description + speech)
- voice-first frontend: live camera, tap-to-describe / hold-to-ask, high-contrast, EN/PT
- core/ pipeline vendored (STT Whisper, VLM Qwen2.5-VL-3B, TTS Piper pt_BR)
- runs in-Space on ZeroGPU, <=32B total, no third-party model APIs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (12) hide show

.gitignore +6 -0
README.md +43 -0
app.py +68 -0
core/__init__.py +0 -0
core/gpu.py +33 -0
core/stt.py +34 -0
core/tts.py +41 -0
core/vlm.py +103 -0
frontend/app.js +139 -0
frontend/index.html +29 -0
frontend/style.css +108 -0
requirements.txt +13 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+__pycache__/
+*.pyc
+.venv/
+.DS_Store
+*.wav
+/tmp/

README.md ADDED Viewed

	@@ -0,0 +1,43 @@

+---
+title: Iris
+emoji: 👁️
+colorFrom: indigo
+colorTo: orange
+sdk: gradio
+sdk_version: 6.17.3
+app_file: app.py
+pinned: false
+license: apache-2.0
+short_description: Your father's eyes, by voice. Describe & ask about the world.
+---
+# Iris — your father's eyes, by voice
+Iris is a voice-first assistant for blind and low-vision users. Open it on a
+phone, point the camera, **tap to describe** what's in front of you or **hold to
+ask a question** ("what color is this?", "read this label", "is anyone here?").
+It answers out loud, in Portuguese or English.
+Built for the **Build Small Hackathon** (Backyard AI track) — for my father.
+## How it works (all small models, ≤ 32B total)
+- **Speech-to-text:** Whisper small (faster-whisper) — understands the question.
+- **Vision-language:** Qwen2.5-VL-3B-Instruct — describes the scene / reads text, in PT.
+- **Text-to-speech:** Piper (pt_BR) — speaks the answer.
+Custom voice-first frontend via `gr.Server` (tap-anywhere, high-contrast,
+camera auto-on). Inference runs in-Space on ZeroGPU — **no third-party model APIs**.
+### Parameter budget
+Whisper-small (~0.24B) + Qwen2.5-VL-3B (~3B) ≈ **~3.3B total** — well under 32B.
+## Not a mobility aid
+Iris describes the environment and reads text. It is **not** an obstacle-avoidance
+or navigation device and should not be relied on for physical safety.
+## Run locally
+```bash
+pip install -r requirements.txt
+python app.py            # http://localhost:7860
+# IRIS_WARMUP=1 pré-carrega os modelos
+```

app.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""Iris — os olhos do seu pai por voz (HF Space, ZeroGPU).
+gr.Server: serve um frontend custom voz-first (frontend/index.html) e expõe a
+API `describe` (imagem + pergunta de voz opcional -> resposta em PT + áudio).
+Pipeline: Whisper (STT) -> Qwen2.5-VL (descrição/VQA em PT) -> Piper (TTS).
+"""
+import os
+import tempfile
+from pathlib import Path
+os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
+from fastapi.responses import HTMLResponse  # noqa: E402
+from fastapi.staticfiles import StaticFiles  # noqa: E402
+from gradio import Server  # noqa: E402
+from gradio.data_classes import FileData  # noqa: E402
+from core import stt, tts, vlm  # noqa: E402
+HERE = Path(__file__).parent
+app = Server(title="Iris")
+def _path(f):
+    """gr.Server entrega FileData como dict; aceita dict ou objeto."""
+    if f is None:
+        return None
+    return f["path"] if isinstance(f, dict) else f.path
+@app.api(name="describe")
+def describe(image: FileData, audio: FileData | None = None) -> dict:
+    """Recebe um frame da câmera (+ pergunta de voz opcional) e devolve a
+    descrição em PT + o áudio falado."""
+    apath = _path(audio)
+    question = stt.transcribe(apath) if apath else ""
+    answer = vlm.describe(_path(image), question)
+    if not answer.strip():
+        answer = "Não consegui descrever isso. Tente de novo."
+    wav = tts.synthesize(answer)
+    print(f"[describe] q={question!r} a={answer!r}", flush=True)
+    return {
+        "question": question,
+        "answer": answer,
+        "audio": FileData(path=wav) if wav else None,
+    }
+app.mount("/static", StaticFiles(directory=str(HERE / "frontend")), name="static")
+@app.get("/")
+def index():
+    return HTMLResponse((HERE / "frontend" / "index.html").read_text(encoding="utf-8"))
+if __name__ == "__main__":
+    if os.environ.get("IRIS_WARMUP") == "1":
+        print("Warmup...", flush=True)
+        try:
+            vlm.describe(str(HERE.parent / "viability" / "samples" / "indoor.jpg"), "teste")
+            stt.transcribe(tts.synthesize("teste"))
+            print("Warmup OK", flush=True)
+        except Exception as e:
+            print("Warmup falhou:", e, flush=True)
+    port = int(os.environ.get("GRADIO_SERVER_PORT", os.environ.get("PORT", 7860)))
+    app.launch(server_name="0.0.0.0", server_port=port, show_error=True,
+               allowed_paths=[tempfile.gettempdir()])

core/__init__.py ADDED Viewed

File without changes

core/gpu.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""ZeroGPU helpers.
+`gpu` é um decorator: no HF Spaces (ZeroGPU) ele aloca uma GPU por chamada;
+localmente (sem o pacote `spaces`) vira um no-op transparente. Assim o mesmo
+código roda nas 3060 locais e no Space.
+"""
+import functools
+try:
+    import spaces  # disponível só nos Spaces ZeroGPU
+    _HAS_SPACES = True
+except Exception:
+    _HAS_SPACES = False
+def gpu(duration: int = 60):
+    """Aloca GPU por `duration`s na chamada (ZeroGPU). No-op local."""
+    def decorate(fn):
+        if _HAS_SPACES:
+            return spaces.GPU(duration=duration)(fn)
+        @functools.wraps(fn)
+        def wrapper(*args, **kwargs):
+            return fn(*args, **kwargs)
+        return wrapper
+    return decorate
+def device() -> str:
+    import torch
+    return "cuda" if torch.cuda.is_available() else "cpu"

core/stt.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""Speech-to-text (Whisper via faster-whisper). Português por padrão.
+Não usa torch — faster-whisper roda em CTranslate2 (GPU se disponível, senão CPU).
+"""
+import os
+_model = None
+def _load():
+    global _model
+    if _model is None:
+        from faster_whisper import WhisperModel
+        size = os.environ.get("IRIS_STT_MODEL", "small")
+        # CTranslate2 precisa das libs CUDA 12 (cublas/cudnn). Se faltarem, CPU.
+        device = os.environ.get("IRIS_STT_DEVICE", "cpu")
+        if device == "cuda":
+            try:
+                _model = WhisperModel(size, device="cuda", compute_type="float16")
+            except Exception:
+                device = "cpu"
+        if device != "cuda":
+            _model = WhisperModel(size, device="cpu", compute_type="int8")
+    return _model
+def transcribe(audio_path: str, language: str = "pt") -> str:
+    if not audio_path or not os.path.exists(audio_path):
+        print(f"[stt] sem audio: {audio_path!r}", flush=True)
+        return ""
+    segments, info = _load().transcribe(audio_path, language=language)
+    text = " ".join(s.text for s in segments).strip()
+    print(f"[stt] {audio_path} ({getattr(info, 'duration', '?')}s) -> {text!r}", flush=True)
+    return text

core/tts.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""Text-to-speech (Piper, pt_BR). Retorna caminho de um .wav.
+Piper roda em onnxruntime (sem torch). A voz é baixada do repo rhasspy/piper-voices.
+"""
+import os
+import tempfile
+import wave
+_voice = None
+_REPO = "rhasspy/piper-voices"
+# voz pt_BR padrão; trocar via IRIS_TTS_VOICE
+_VOICE = os.environ.get("IRIS_TTS_VOICE", "pt/pt_BR/faber/medium/pt_BR-faber-medium")
+def _load():
+    global _voice
+    if _voice is None:
+        from huggingface_hub import hf_hub_download
+        from piper import PiperVoice
+        onnx = hf_hub_download(_REPO, f"{_VOICE}.onnx")
+        conf = hf_hub_download(_REPO, f"{_VOICE}.onnx.json")
+        _voice = PiperVoice.load(onnx, config_path=conf)
+    return _voice
+def synthesize(text: str) -> str | None:
+    if not text or not text.strip():
+        return None
+    voice = _load()
+    chunks = list(voice.synthesize(text))
+    if not chunks:
+        print(f"[tts] sem áudio para texto: {text!r}", flush=True)
+        return None
+    path = tempfile.mktemp(suffix=".wav")
+    with wave.open(path, "wb") as wf:
+        wf.setnchannels(chunks[0].sample_channels)
+        wf.setsampwidth(chunks[0].sample_width)
+        wf.setframerate(chunks[0].sample_rate)
+        for ch in chunks:
+            wf.writeframes(ch.audio_int16_bytes)
+    return path

core/vlm.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Vision-language: descreve / responde sobre uma imagem, em PT.
+Backend plugável via IRIS_VLM_MODEL:
+  - Qwen/Qwen2.5-VL-3B-Instruct (default)
+  - openbmb/MiniCPM-V-4.6  -> track OpenBMB ($2.5k); API model.chat()
+O VLM É o gerador do texto que vai pra voz — sem LLM narrador separado.
+"""
+import os
+_model = None
+_aux = None  # processor (qwen) ou tokenizer (minicpm)
+MODEL_ID = os.environ.get("IRIS_VLM_MODEL", "Qwen/Qwen2.5-VL-3B-Instruct")
+SYSTEM_PT = (
+    "Você é os olhos de uma pessoa cega. RESPONDA OBRIGATORIAMENTE EM PORTUGUÊS "
+    "DO BRASIL, em no máximo duas frases curtas, dizendo só o que é relevante e "
+    "útil sobre a cena. Não comece com 'a imagem mostra'. Se houver texto "
+    "importante (rótulo, placa, remédio), leia-o exatamente como está."
+)
+DOWNSAMPLE = os.environ.get("IRIS_DOWNSAMPLE", "4x")  # 4x = detalhe fino (OCR); 16x = rápido
+def _family() -> str:
+    return "minicpm" if "minicpm" in MODEL_ID.lower() else "qwen"
+def _load():
+    global _model, _aux
+    if _model is None:
+        import torch
+        if _family() == "minicpm":
+            from transformers import AutoModelForImageTextToText, AutoProcessor
+            _model = AutoModelForImageTextToText.from_pretrained(
+                MODEL_ID, trust_remote_code=True, torch_dtype=torch.float16,
+                low_cpu_mem_usage=True, device_map="cuda:0",
+            ).eval()
+            _aux = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
+        else:
+            from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+            _model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+                MODEL_ID, torch_dtype=torch.float16, device_map="cuda:0",
+                low_cpu_mem_usage=True,
+            ).eval()
+            _aux = AutoProcessor.from_pretrained(MODEL_ID)
+    return _model, _aux
+def _to_pil(image):
+    from PIL import Image
+    if isinstance(image, str):
+        image = Image.open(image)
+    elif not isinstance(image, Image.Image):
+        image = Image.fromarray(image)  # frame numpy da webcam
+    image = image.convert("RGB")
+    image.thumbnail((1024, 1024))  # menos tokens de visão -> mais rápido; mantém OCR
+    return image
+from .gpu import gpu
+@gpu(duration=60)
+def describe(image, question: str = "") -> str:
+    image = _to_pil(image)
+    user = (question or "").strip() or "O que há à minha frente?"
+    model, aux = _load()
+    if _family() == "minicpm":
+        import torch
+        messages = [{"role": "user", "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": f"{SYSTEM_PT}\n\nPergunta: {user}"},
+        ]}]
+        inputs = aux.apply_chat_template(
+            messages, tokenize=True, add_generation_prompt=True,
+            return_dict=True, return_tensors="pt",
+            downsample_mode=DOWNSAMPLE, max_slice_nums=36,
+        ).to(model.device)
+        with torch.no_grad():
+            generated = model.generate(**inputs, downsample_mode=DOWNSAMPLE,
+                                       max_new_tokens=96, do_sample=False)
+        trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated)]
+        return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
+    # Qwen2.5-VL
+    import torch
+    from qwen_vl_utils import process_vision_info
+    messages = [
+        {"role": "system", "content": SYSTEM_PT},
+        {"role": "user", "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": user},
+        ]},
+    ]
+    text = aux.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = aux(text=[text], images=image_inputs, videos=video_inputs,
+                 padding=True, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        generated = model.generate(**inputs, max_new_tokens=96, do_sample=False)
+    trimmed = generated[:, inputs.input_ids.shape[1]:]
+    return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()

frontend/app.js ADDED Viewed

	@@ -0,0 +1,139 @@

+import { Client, handle_file } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
+const $ = (id) => document.getElementById(id);
+const cam = $("cam"), canvas = $("canvas"), statusEl = $("status"),
+      answerEl = $("answer"), hintEl = $("hint"), langBtn = $("lang"),
+      player = $("player"), stage = $("stage");
+// ---- i18n (EN default + PT) ----
+let lang = "en";
+const T = {
+  en: { idle: "Iris", listening: "Listening…", thinking: "Thinking…",
+        hint: "Tap to describe · hold to ask", camErr: "Camera blocked — allow access",
+        err: "Something went wrong" },
+  pt: { idle: "Iris", listening: "Ouvindo…", thinking: "Pensando…",
+        hint: "Toque para descrever · segure para perguntar", camErr: "Câmera bloqueada — permita o acesso",
+        err: "Algo deu errado" },
+};
+const t = (k) => T[lang][k];
+function setState(s, msg) {
+  document.body.dataset.state = s || "";
+  if (msg !== undefined) statusEl.textContent = msg;
+}
+langBtn.onclick = (e) => {
+  e.stopPropagation();
+  lang = lang === "en" ? "pt" : "en";
+  langBtn.textContent = lang.toUpperCase();
+  hintEl.textContent = t("hint");
+  document.documentElement.lang = lang;
+  if (!busy) statusEl.textContent = t("idle");
+};
+// ---- câmera ao vivo (auto) ----
+async function startCamera() {
+  const tries = [
+    { video: { facingMode: { ideal: "environment" } }, audio: false },
+    { video: true, audio: false },
+  ];
+  for (const c of tries) {
+    try { cam.srcObject = await navigator.mediaDevices.getUserMedia(c); return; }
+    catch (e) { /* tenta o próximo */ }
+  }
+  setState("", t("camErr"));
+}
+function grabFrame() {
+  const w = cam.videoWidth, h = cam.videoHeight;
+  if (!w || !h) return Promise.resolve(null);
+  canvas.width = w; canvas.height = h;
+  canvas.getContext("2d").drawImage(cam, 0, 0, w, h);
+  return new Promise((res) => canvas.toBlob(res, "image/jpeg", 0.85));
+}
+// ---- microfone (grava on-hold) ----
+let micStream = null, recorder = null, chunks = [];
+async function startRec() {
+  try { if (!micStream) micStream = await navigator.mediaDevices.getUserMedia({ audio: true }); }
+  catch (e) { return false; }
+  chunks = [];
+  recorder = new MediaRecorder(micStream);
+  recorder.ondataavailable = (e) => { if (e.data.size) chunks.push(e.data); };
+  recorder.start();
+  return true;
+}
+function stopRec() {
+  return new Promise((res) => {
+    if (!recorder || recorder.state === "inactive") return res(null);
+    recorder.onstop = () => res(chunks.length ? new Blob(chunks, { type: recorder.mimeType || "audio/webm" }) : null);
+    recorder.stop();
+  });
+}
+// ---- backend ----
+let client = null;
+let busy = false;
+async function send(frame, audio) {
+  if (busy) return;
+  busy = true;
+  answerEl.textContent = "";
+  setState("thinking", t("thinking"));
+  try {
+    const payload = { image: handle_file(frame) };
+    if (audio) payload.audio = handle_file(audio);
+    const result = await client.predict("/describe", payload);
+    const out = Array.isArray(result.data) ? result.data[0] : result.data;
+    answerEl.textContent = (out && out.answer) || "";
+    setState("speaking", "");
+    const url = out && out.audio && (out.audio.url || out.audio.path);
+    if (url) { player.src = url; await player.play().catch(() => {}); }
+    else { resetSoon(); }
+  } catch (e) {
+    console.error(e);
+    setState("", t("err"));
+    resetSoon();
+  } finally {
+    busy = false;
+  }
+}
+function resetSoon() { setTimeout(() => { if (!busy) setState("", t("idle")); }, 600); }
+player.addEventListener("ended", () => { if (!busy) setState("", t("idle")); });
+// ---- interação: tap = descrever · segurar = perguntar ----
+let holding = false, recording = false, holdTimer = null;
+const HOLD_MS = 350;
+stage.addEventListener("pointerdown", () => {
+  if (busy) return;
+  holding = true;
+  stage.classList.add("armed");
+  holdTimer = setTimeout(async () => {
+    recording = await startRec();
+    if (recording) setState("listening", t("listening"));
+  }, HOLD_MS);
+});
+async function endPress() {
+  if (!holding) return;
+  holding = false;
+  stage.classList.remove("armed");
+  clearTimeout(holdTimer);
+  const frame = await grabFrame();
+  if (!frame) { setState("", t("camErr")); return; }
+  if (recording) { const audio = await stopRec(); recording = false; send(frame, audio); }
+  else { send(frame, null); }  // tap rápido = só descrever
+}
+stage.addEventListener("pointerup", endPress);
+stage.addEventListener("pointercancel", endPress);
+// ---- boot ----
+(async () => {
+  langBtn.textContent = lang.toUpperCase();
+  hintEl.textContent = t("hint");
+  statusEl.textContent = t("idle");
+  await startCamera();
+  try { client = await Client.connect(window.location.origin); }
+  catch (e) { console.error(e); setState("", t("err")); }
+})();

frontend/index.html ADDED Viewed

	@@ -0,0 +1,29 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, viewport-fit=cover" />
+  <title>Iris</title>
+  <link rel="stylesheet" href="/static/style.css" />
+</head>
+<body>
+  <!-- câmera ao vivo (fundo) -->
+  <video id="cam" autoplay playsinline muted></video>
+  <canvas id="canvas" hidden></canvas>
+  <!-- alvo único: toda a tela é o botão -->
+  <main id="stage" role="application" aria-label="Iris">
+    <div id="halo"></div>
+    <p id="status" aria-live="assertive">Iris</p>
+    <p id="answer" aria-live="polite"></p>
+    <p id="hint">Toque para descrever · segure para perguntar</p>
+  </main>
+  <!-- idioma -->
+  <button id="lang" aria-label="Idioma">PT</button>
+  <audio id="player" playsinline></audio>
+  <script type="module" src="/static/app.js"></script>
+</body>
+</html>

frontend/style.css ADDED Viewed

	@@ -0,0 +1,108 @@

+:root {
+  --bg: #05060a;
+  --fg: #ffffff;
+  --accent: #ff7a18;
+  --accent-2: #00d4ff;
+  --ok: #36d399;
+}
+* { box-sizing: border-box; margin: 0; padding: 0; }
+html, body {
+  height: 100%;
+  background: var(--bg);
+  color: var(--fg);
+  font-family: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
+  overflow: hidden;
+  -webkit-tap-highlight-color: transparent;
+  user-select: none;
+}
+/* câmera ao vivo como fundo, escurecida pra contraste do texto */
+#cam {
+  position: fixed;
+  inset: 0;
+  width: 100%;
+  height: 100%;
+  object-fit: cover;
+  filter: brightness(0.45) saturate(1.05);
+  z-index: 0;
+}
+/* a tela inteira é o botão */
+#stage {
+  position: fixed;
+  inset: 0;
+  z-index: 1;
+  display: flex;
+  flex-direction: column;
+  align-items: center;
+  justify-content: center;
+  text-align: center;
+  padding: 6vmin;
+  gap: 3vmin;
+  cursor: pointer;
+}
+/* halo central que reage ao estado */
+#halo {
+  position: absolute;
+  width: 64vmin;
+  height: 64vmin;
+  border-radius: 50%;
+  background: radial-gradient(circle, rgba(255,122,24,0.18), transparent 62%);
+  transition: transform .25s ease, background .25s ease;
+  pointer-events: none;
+}
+body[data-state="listening"] #halo { background: radial-gradient(circle, rgba(0,212,255,0.30), transparent 60%); transform: scale(1.12); animation: pulse 1.1s ease-in-out infinite; }
+body[data-state="thinking"] #halo { background: radial-gradient(circle, rgba(255,122,24,0.28), transparent 60%); animation: spin 1.4s linear infinite; }
+body[data-state="speaking"] #halo { background: radial-gradient(circle, rgba(54,211,153,0.28), transparent 60%); transform: scale(1.06); }
+@keyframes pulse { 0%,100%{transform:scale(1.10)} 50%{transform:scale(1.22)} }
+@keyframes spin { to { transform: rotate(360deg) } }
+#status {
+  position: relative;
+  z-index: 2;
+  font-size: clamp(22px, 6vmin, 46px);
+  font-weight: 800;
+  letter-spacing: .01em;
+  text-shadow: 0 2px 18px rgba(0,0,0,.8);
+}
+#answer {
+  position: relative;
+  z-index: 2;
+  font-size: clamp(20px, 4.6vmin, 34px);
+  font-weight: 600;
+  line-height: 1.3;
+  max-width: 92vw;
+  text-shadow: 0 2px 18px rgba(0,0,0,.9);
+  min-height: 1.2em;
+}
+#hint {
+  position: fixed;
+  bottom: max(5vmin, env(safe-area-inset-bottom, 0));
+  z-index: 2;
+  font-size: clamp(14px, 3vmin, 20px);
+  opacity: .8;
+  text-shadow: 0 2px 12px rgba(0,0,0,.9);
+}
+#lang {
+  position: fixed;
+  top: max(4vmin, env(safe-area-inset-top, 0));
+  right: 5vmin;
+  z-index: 3;
+  background: rgba(255,255,255,.12);
+  color: var(--fg);
+  border: 2px solid rgba(255,255,255,.35);
+  border-radius: 999px;
+  padding: 10px 18px;
+  font-size: 18px;
+  font-weight: 700;
+  cursor: pointer;
+}
+#stage.armed { background: rgba(0,212,255,.05); }

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+gradio>=6
+spaces
+torch
+torchvision
+transformers>=4.49
+accelerate
+qwen-vl-utils
+faster-whisper
+piper-tts
+pillow
+soundfile
+numpy
+huggingface_hub