Marcus Ramalho Claude Opus 4.8 commited on
Commit
26dae50
·
0 Parent(s):

Iris: voice-first assistant for the blind (gr.Server + Qwen2.5-VL + Whisper + Piper)

Browse files

- gr.Server backend: /describe API (image + optional voice -> PT description + speech)
- voice-first frontend: live camera, tap-to-describe / hold-to-ask, high-contrast, EN/PT
- core/ pipeline vendored (STT Whisper, VLM Qwen2.5-VL-3B, TTS Piper pt_BR)
- runs in-Space on ZeroGPU, <=32B total, no third-party model APIs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Files changed (12) hide show
  1. .gitignore +6 -0
  2. README.md +43 -0
  3. app.py +68 -0
  4. core/__init__.py +0 -0
  5. core/gpu.py +33 -0
  6. core/stt.py +34 -0
  7. core/tts.py +41 -0
  8. core/vlm.py +103 -0
  9. frontend/app.js +139 -0
  10. frontend/index.html +29 -0
  11. frontend/style.css +108 -0
  12. requirements.txt +13 -0
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .venv/
4
+ .DS_Store
5
+ *.wav
6
+ /tmp/
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Iris
3
+ emoji: 👁️
4
+ colorFrom: indigo
5
+ colorTo: orange
6
+ sdk: gradio
7
+ sdk_version: 6.17.3
8
+ app_file: app.py
9
+ pinned: false
10
+ license: apache-2.0
11
+ short_description: Your father's eyes, by voice. Describe & ask about the world.
12
+ ---
13
+
14
+ # Iris — your father's eyes, by voice
15
+
16
+ Iris is a voice-first assistant for blind and low-vision users. Open it on a
17
+ phone, point the camera, **tap to describe** what's in front of you or **hold to
18
+ ask a question** ("what color is this?", "read this label", "is anyone here?").
19
+ It answers out loud, in Portuguese or English.
20
+
21
+ Built for the **Build Small Hackathon** (Backyard AI track) — for my father.
22
+
23
+ ## How it works (all small models, ≤ 32B total)
24
+ - **Speech-to-text:** Whisper small (faster-whisper) — understands the question.
25
+ - **Vision-language:** Qwen2.5-VL-3B-Instruct — describes the scene / reads text, in PT.
26
+ - **Text-to-speech:** Piper (pt_BR) — speaks the answer.
27
+
28
+ Custom voice-first frontend via `gr.Server` (tap-anywhere, high-contrast,
29
+ camera auto-on). Inference runs in-Space on ZeroGPU — **no third-party model APIs**.
30
+
31
+ ### Parameter budget
32
+ Whisper-small (~0.24B) + Qwen2.5-VL-3B (~3B) ≈ **~3.3B total** — well under 32B.
33
+
34
+ ## Not a mobility aid
35
+ Iris describes the environment and reads text. It is **not** an obstacle-avoidance
36
+ or navigation device and should not be relied on for physical safety.
37
+
38
+ ## Run locally
39
+ ```bash
40
+ pip install -r requirements.txt
41
+ python app.py # http://localhost:7860
42
+ # IRIS_WARMUP=1 pré-carrega os modelos
43
+ ```
app.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Iris — os olhos do seu pai por voz (HF Space, ZeroGPU).
2
+
3
+ gr.Server: serve um frontend custom voz-first (frontend/index.html) e expõe a
4
+ API `describe` (imagem + pergunta de voz opcional -> resposta em PT + áudio).
5
+ Pipeline: Whisper (STT) -> Qwen2.5-VL (descrição/VQA em PT) -> Piper (TTS).
6
+ """
7
+ import os
8
+ import tempfile
9
+ from pathlib import Path
10
+
11
+ os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
12
+
13
+ from fastapi.responses import HTMLResponse # noqa: E402
14
+ from fastapi.staticfiles import StaticFiles # noqa: E402
15
+ from gradio import Server # noqa: E402
16
+ from gradio.data_classes import FileData # noqa: E402
17
+
18
+ from core import stt, tts, vlm # noqa: E402
19
+
20
+ HERE = Path(__file__).parent
21
+ app = Server(title="Iris")
22
+
23
+
24
+ def _path(f):
25
+ """gr.Server entrega FileData como dict; aceita dict ou objeto."""
26
+ if f is None:
27
+ return None
28
+ return f["path"] if isinstance(f, dict) else f.path
29
+
30
+
31
+ @app.api(name="describe")
32
+ def describe(image: FileData, audio: FileData | None = None) -> dict:
33
+ """Recebe um frame da câmera (+ pergunta de voz opcional) e devolve a
34
+ descrição em PT + o áudio falado."""
35
+ apath = _path(audio)
36
+ question = stt.transcribe(apath) if apath else ""
37
+ answer = vlm.describe(_path(image), question)
38
+ if not answer.strip():
39
+ answer = "Não consegui descrever isso. Tente de novo."
40
+ wav = tts.synthesize(answer)
41
+ print(f"[describe] q={question!r} a={answer!r}", flush=True)
42
+ return {
43
+ "question": question,
44
+ "answer": answer,
45
+ "audio": FileData(path=wav) if wav else None,
46
+ }
47
+
48
+
49
+ app.mount("/static", StaticFiles(directory=str(HERE / "frontend")), name="static")
50
+
51
+
52
+ @app.get("/")
53
+ def index():
54
+ return HTMLResponse((HERE / "frontend" / "index.html").read_text(encoding="utf-8"))
55
+
56
+
57
+ if __name__ == "__main__":
58
+ if os.environ.get("IRIS_WARMUP") == "1":
59
+ print("Warmup...", flush=True)
60
+ try:
61
+ vlm.describe(str(HERE.parent / "viability" / "samples" / "indoor.jpg"), "teste")
62
+ stt.transcribe(tts.synthesize("teste"))
63
+ print("Warmup OK", flush=True)
64
+ except Exception as e:
65
+ print("Warmup falhou:", e, flush=True)
66
+ port = int(os.environ.get("GRADIO_SERVER_PORT", os.environ.get("PORT", 7860)))
67
+ app.launch(server_name="0.0.0.0", server_port=port, show_error=True,
68
+ allowed_paths=[tempfile.gettempdir()])
core/__init__.py ADDED
File without changes
core/gpu.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ZeroGPU helpers.
2
+
3
+ `gpu` é um decorator: no HF Spaces (ZeroGPU) ele aloca uma GPU por chamada;
4
+ localmente (sem o pacote `spaces`) vira um no-op transparente. Assim o mesmo
5
+ código roda nas 3060 locais e no Space.
6
+ """
7
+ import functools
8
+
9
+ try:
10
+ import spaces # disponível só nos Spaces ZeroGPU
11
+ _HAS_SPACES = True
12
+ except Exception:
13
+ _HAS_SPACES = False
14
+
15
+
16
+ def gpu(duration: int = 60):
17
+ """Aloca GPU por `duration`s na chamada (ZeroGPU). No-op local."""
18
+ def decorate(fn):
19
+ if _HAS_SPACES:
20
+ return spaces.GPU(duration=duration)(fn)
21
+
22
+ @functools.wraps(fn)
23
+ def wrapper(*args, **kwargs):
24
+ return fn(*args, **kwargs)
25
+
26
+ return wrapper
27
+
28
+ return decorate
29
+
30
+
31
+ def device() -> str:
32
+ import torch
33
+ return "cuda" if torch.cuda.is_available() else "cpu"
core/stt.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Speech-to-text (Whisper via faster-whisper). Português por padrão.
2
+
3
+ Não usa torch — faster-whisper roda em CTranslate2 (GPU se disponível, senão CPU).
4
+ """
5
+ import os
6
+
7
+ _model = None
8
+
9
+
10
+ def _load():
11
+ global _model
12
+ if _model is None:
13
+ from faster_whisper import WhisperModel
14
+ size = os.environ.get("IRIS_STT_MODEL", "small")
15
+ # CTranslate2 precisa das libs CUDA 12 (cublas/cudnn). Se faltarem, CPU.
16
+ device = os.environ.get("IRIS_STT_DEVICE", "cpu")
17
+ if device == "cuda":
18
+ try:
19
+ _model = WhisperModel(size, device="cuda", compute_type="float16")
20
+ except Exception:
21
+ device = "cpu"
22
+ if device != "cuda":
23
+ _model = WhisperModel(size, device="cpu", compute_type="int8")
24
+ return _model
25
+
26
+
27
+ def transcribe(audio_path: str, language: str = "pt") -> str:
28
+ if not audio_path or not os.path.exists(audio_path):
29
+ print(f"[stt] sem audio: {audio_path!r}", flush=True)
30
+ return ""
31
+ segments, info = _load().transcribe(audio_path, language=language)
32
+ text = " ".join(s.text for s in segments).strip()
33
+ print(f"[stt] {audio_path} ({getattr(info, 'duration', '?')}s) -> {text!r}", flush=True)
34
+ return text
core/tts.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Text-to-speech (Piper, pt_BR). Retorna caminho de um .wav.
2
+
3
+ Piper roda em onnxruntime (sem torch). A voz é baixada do repo rhasspy/piper-voices.
4
+ """
5
+ import os
6
+ import tempfile
7
+ import wave
8
+
9
+ _voice = None
10
+ _REPO = "rhasspy/piper-voices"
11
+ # voz pt_BR padrão; trocar via IRIS_TTS_VOICE
12
+ _VOICE = os.environ.get("IRIS_TTS_VOICE", "pt/pt_BR/faber/medium/pt_BR-faber-medium")
13
+
14
+
15
+ def _load():
16
+ global _voice
17
+ if _voice is None:
18
+ from huggingface_hub import hf_hub_download
19
+ from piper import PiperVoice
20
+ onnx = hf_hub_download(_REPO, f"{_VOICE}.onnx")
21
+ conf = hf_hub_download(_REPO, f"{_VOICE}.onnx.json")
22
+ _voice = PiperVoice.load(onnx, config_path=conf)
23
+ return _voice
24
+
25
+
26
+ def synthesize(text: str) -> str | None:
27
+ if not text or not text.strip():
28
+ return None
29
+ voice = _load()
30
+ chunks = list(voice.synthesize(text))
31
+ if not chunks:
32
+ print(f"[tts] sem áudio para texto: {text!r}", flush=True)
33
+ return None
34
+ path = tempfile.mktemp(suffix=".wav")
35
+ with wave.open(path, "wb") as wf:
36
+ wf.setnchannels(chunks[0].sample_channels)
37
+ wf.setsampwidth(chunks[0].sample_width)
38
+ wf.setframerate(chunks[0].sample_rate)
39
+ for ch in chunks:
40
+ wf.writeframes(ch.audio_int16_bytes)
41
+ return path
core/vlm.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Vision-language: descreve / responde sobre uma imagem, em PT.
2
+
3
+ Backend plugável via IRIS_VLM_MODEL:
4
+ - Qwen/Qwen2.5-VL-3B-Instruct (default)
5
+ - openbmb/MiniCPM-V-4.6 -> track OpenBMB ($2.5k); API model.chat()
6
+ O VLM É o gerador do texto que vai pra voz — sem LLM narrador separado.
7
+ """
8
+ import os
9
+
10
+ _model = None
11
+ _aux = None # processor (qwen) ou tokenizer (minicpm)
12
+ MODEL_ID = os.environ.get("IRIS_VLM_MODEL", "Qwen/Qwen2.5-VL-3B-Instruct")
13
+
14
+ SYSTEM_PT = (
15
+ "Você é os olhos de uma pessoa cega. RESPONDA OBRIGATORIAMENTE EM PORTUGUÊS "
16
+ "DO BRASIL, em no máximo duas frases curtas, dizendo só o que é relevante e "
17
+ "útil sobre a cena. Não comece com 'a imagem mostra'. Se houver texto "
18
+ "importante (rótulo, placa, remédio), leia-o exatamente como está."
19
+ )
20
+
21
+ DOWNSAMPLE = os.environ.get("IRIS_DOWNSAMPLE", "4x") # 4x = detalhe fino (OCR); 16x = rápido
22
+
23
+
24
+ def _family() -> str:
25
+ return "minicpm" if "minicpm" in MODEL_ID.lower() else "qwen"
26
+
27
+
28
+ def _load():
29
+ global _model, _aux
30
+ if _model is None:
31
+ import torch
32
+ if _family() == "minicpm":
33
+ from transformers import AutoModelForImageTextToText, AutoProcessor
34
+ _model = AutoModelForImageTextToText.from_pretrained(
35
+ MODEL_ID, trust_remote_code=True, torch_dtype=torch.float16,
36
+ low_cpu_mem_usage=True, device_map="cuda:0",
37
+ ).eval()
38
+ _aux = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
39
+ else:
40
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
41
+ _model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
42
+ MODEL_ID, torch_dtype=torch.float16, device_map="cuda:0",
43
+ low_cpu_mem_usage=True,
44
+ ).eval()
45
+ _aux = AutoProcessor.from_pretrained(MODEL_ID)
46
+ return _model, _aux
47
+
48
+
49
+ def _to_pil(image):
50
+ from PIL import Image
51
+ if isinstance(image, str):
52
+ image = Image.open(image)
53
+ elif not isinstance(image, Image.Image):
54
+ image = Image.fromarray(image) # frame numpy da webcam
55
+ image = image.convert("RGB")
56
+ image.thumbnail((1024, 1024)) # menos tokens de visão -> mais rápido; mantém OCR
57
+ return image
58
+
59
+
60
+ from .gpu import gpu
61
+
62
+
63
+ @gpu(duration=60)
64
+ def describe(image, question: str = "") -> str:
65
+ image = _to_pil(image)
66
+ user = (question or "").strip() or "O que há à minha frente?"
67
+ model, aux = _load()
68
+
69
+ if _family() == "minicpm":
70
+ import torch
71
+ messages = [{"role": "user", "content": [
72
+ {"type": "image", "image": image},
73
+ {"type": "text", "text": f"{SYSTEM_PT}\n\nPergunta: {user}"},
74
+ ]}]
75
+ inputs = aux.apply_chat_template(
76
+ messages, tokenize=True, add_generation_prompt=True,
77
+ return_dict=True, return_tensors="pt",
78
+ downsample_mode=DOWNSAMPLE, max_slice_nums=36,
79
+ ).to(model.device)
80
+ with torch.no_grad():
81
+ generated = model.generate(**inputs, downsample_mode=DOWNSAMPLE,
82
+ max_new_tokens=96, do_sample=False)
83
+ trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated)]
84
+ return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
85
+
86
+ # Qwen2.5-VL
87
+ import torch
88
+ from qwen_vl_utils import process_vision_info
89
+ messages = [
90
+ {"role": "system", "content": SYSTEM_PT},
91
+ {"role": "user", "content": [
92
+ {"type": "image", "image": image},
93
+ {"type": "text", "text": user},
94
+ ]},
95
+ ]
96
+ text = aux.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+ image_inputs, video_inputs = process_vision_info(messages)
98
+ inputs = aux(text=[text], images=image_inputs, videos=video_inputs,
99
+ padding=True, return_tensors="pt").to(model.device)
100
+ with torch.no_grad():
101
+ generated = model.generate(**inputs, max_new_tokens=96, do_sample=False)
102
+ trimmed = generated[:, inputs.input_ids.shape[1]:]
103
+ return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
frontend/app.js ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { Client, handle_file } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
2
+
3
+ const $ = (id) => document.getElementById(id);
4
+ const cam = $("cam"), canvas = $("canvas"), statusEl = $("status"),
5
+ answerEl = $("answer"), hintEl = $("hint"), langBtn = $("lang"),
6
+ player = $("player"), stage = $("stage");
7
+
8
+ // ---- i18n (EN default + PT) ----
9
+ let lang = "en";
10
+ const T = {
11
+ en: { idle: "Iris", listening: "Listening…", thinking: "Thinking…",
12
+ hint: "Tap to describe · hold to ask", camErr: "Camera blocked — allow access",
13
+ err: "Something went wrong" },
14
+ pt: { idle: "Iris", listening: "Ouvindo…", thinking: "Pensando…",
15
+ hint: "Toque para descrever · segure para perguntar", camErr: "Câmera bloqueada — permita o acesso",
16
+ err: "Algo deu errado" },
17
+ };
18
+ const t = (k) => T[lang][k];
19
+
20
+ function setState(s, msg) {
21
+ document.body.dataset.state = s || "";
22
+ if (msg !== undefined) statusEl.textContent = msg;
23
+ }
24
+
25
+ langBtn.onclick = (e) => {
26
+ e.stopPropagation();
27
+ lang = lang === "en" ? "pt" : "en";
28
+ langBtn.textContent = lang.toUpperCase();
29
+ hintEl.textContent = t("hint");
30
+ document.documentElement.lang = lang;
31
+ if (!busy) statusEl.textContent = t("idle");
32
+ };
33
+
34
+ // ---- câmera ao vivo (auto) ----
35
+ async function startCamera() {
36
+ const tries = [
37
+ { video: { facingMode: { ideal: "environment" } }, audio: false },
38
+ { video: true, audio: false },
39
+ ];
40
+ for (const c of tries) {
41
+ try { cam.srcObject = await navigator.mediaDevices.getUserMedia(c); return; }
42
+ catch (e) { /* tenta o próximo */ }
43
+ }
44
+ setState("", t("camErr"));
45
+ }
46
+
47
+ function grabFrame() {
48
+ const w = cam.videoWidth, h = cam.videoHeight;
49
+ if (!w || !h) return Promise.resolve(null);
50
+ canvas.width = w; canvas.height = h;
51
+ canvas.getContext("2d").drawImage(cam, 0, 0, w, h);
52
+ return new Promise((res) => canvas.toBlob(res, "image/jpeg", 0.85));
53
+ }
54
+
55
+ // ---- microfone (grava on-hold) ----
56
+ let micStream = null, recorder = null, chunks = [];
57
+ async function startRec() {
58
+ try { if (!micStream) micStream = await navigator.mediaDevices.getUserMedia({ audio: true }); }
59
+ catch (e) { return false; }
60
+ chunks = [];
61
+ recorder = new MediaRecorder(micStream);
62
+ recorder.ondataavailable = (e) => { if (e.data.size) chunks.push(e.data); };
63
+ recorder.start();
64
+ return true;
65
+ }
66
+ function stopRec() {
67
+ return new Promise((res) => {
68
+ if (!recorder || recorder.state === "inactive") return res(null);
69
+ recorder.onstop = () => res(chunks.length ? new Blob(chunks, { type: recorder.mimeType || "audio/webm" }) : null);
70
+ recorder.stop();
71
+ });
72
+ }
73
+
74
+ // ---- backend ----
75
+ let client = null;
76
+ let busy = false;
77
+
78
+ async function send(frame, audio) {
79
+ if (busy) return;
80
+ busy = true;
81
+ answerEl.textContent = "";
82
+ setState("thinking", t("thinking"));
83
+ try {
84
+ const payload = { image: handle_file(frame) };
85
+ if (audio) payload.audio = handle_file(audio);
86
+ const result = await client.predict("/describe", payload);
87
+ const out = Array.isArray(result.data) ? result.data[0] : result.data;
88
+ answerEl.textContent = (out && out.answer) || "";
89
+ setState("speaking", "");
90
+ const url = out && out.audio && (out.audio.url || out.audio.path);
91
+ if (url) { player.src = url; await player.play().catch(() => {}); }
92
+ else { resetSoon(); }
93
+ } catch (e) {
94
+ console.error(e);
95
+ setState("", t("err"));
96
+ resetSoon();
97
+ } finally {
98
+ busy = false;
99
+ }
100
+ }
101
+ function resetSoon() { setTimeout(() => { if (!busy) setState("", t("idle")); }, 600); }
102
+ player.addEventListener("ended", () => { if (!busy) setState("", t("idle")); });
103
+
104
+ // ---- interação: tap = descrever · segurar = perguntar ----
105
+ let holding = false, recording = false, holdTimer = null;
106
+ const HOLD_MS = 350;
107
+
108
+ stage.addEventListener("pointerdown", () => {
109
+ if (busy) return;
110
+ holding = true;
111
+ stage.classList.add("armed");
112
+ holdTimer = setTimeout(async () => {
113
+ recording = await startRec();
114
+ if (recording) setState("listening", t("listening"));
115
+ }, HOLD_MS);
116
+ });
117
+
118
+ async function endPress() {
119
+ if (!holding) return;
120
+ holding = false;
121
+ stage.classList.remove("armed");
122
+ clearTimeout(holdTimer);
123
+ const frame = await grabFrame();
124
+ if (!frame) { setState("", t("camErr")); return; }
125
+ if (recording) { const audio = await stopRec(); recording = false; send(frame, audio); }
126
+ else { send(frame, null); } // tap rápido = só descrever
127
+ }
128
+ stage.addEventListener("pointerup", endPress);
129
+ stage.addEventListener("pointercancel", endPress);
130
+
131
+ // ---- boot ----
132
+ (async () => {
133
+ langBtn.textContent = lang.toUpperCase();
134
+ hintEl.textContent = t("hint");
135
+ statusEl.textContent = t("idle");
136
+ await startCamera();
137
+ try { client = await Client.connect(window.location.origin); }
138
+ catch (e) { console.error(e); setState("", t("err")); }
139
+ })();
frontend/index.html ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="utf-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, viewport-fit=cover" />
6
+ <title>Iris</title>
7
+ <link rel="stylesheet" href="/static/style.css" />
8
+ </head>
9
+ <body>
10
+ <!-- câmera ao vivo (fundo) -->
11
+ <video id="cam" autoplay playsinline muted></video>
12
+ <canvas id="canvas" hidden></canvas>
13
+
14
+ <!-- alvo único: toda a tela é o botão -->
15
+ <main id="stage" role="application" aria-label="Iris">
16
+ <div id="halo"></div>
17
+ <p id="status" aria-live="assertive">Iris</p>
18
+ <p id="answer" aria-live="polite"></p>
19
+ <p id="hint">Toque para descrever · segure para perguntar</p>
20
+ </main>
21
+
22
+ <!-- idioma -->
23
+ <button id="lang" aria-label="Idioma">PT</button>
24
+
25
+ <audio id="player" playsinline></audio>
26
+
27
+ <script type="module" src="/static/app.js"></script>
28
+ </body>
29
+ </html>
frontend/style.css ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ :root {
2
+ --bg: #05060a;
3
+ --fg: #ffffff;
4
+ --accent: #ff7a18;
5
+ --accent-2: #00d4ff;
6
+ --ok: #36d399;
7
+ }
8
+
9
+ * { box-sizing: border-box; margin: 0; padding: 0; }
10
+
11
+ html, body {
12
+ height: 100%;
13
+ background: var(--bg);
14
+ color: var(--fg);
15
+ font-family: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
16
+ overflow: hidden;
17
+ -webkit-tap-highlight-color: transparent;
18
+ user-select: none;
19
+ }
20
+
21
+ /* câmera ao vivo como fundo, escurecida pra contraste do texto */
22
+ #cam {
23
+ position: fixed;
24
+ inset: 0;
25
+ width: 100%;
26
+ height: 100%;
27
+ object-fit: cover;
28
+ filter: brightness(0.45) saturate(1.05);
29
+ z-index: 0;
30
+ }
31
+
32
+ /* a tela inteira é o botão */
33
+ #stage {
34
+ position: fixed;
35
+ inset: 0;
36
+ z-index: 1;
37
+ display: flex;
38
+ flex-direction: column;
39
+ align-items: center;
40
+ justify-content: center;
41
+ text-align: center;
42
+ padding: 6vmin;
43
+ gap: 3vmin;
44
+ cursor: pointer;
45
+ }
46
+
47
+ /* halo central que reage ao estado */
48
+ #halo {
49
+ position: absolute;
50
+ width: 64vmin;
51
+ height: 64vmin;
52
+ border-radius: 50%;
53
+ background: radial-gradient(circle, rgba(255,122,24,0.18), transparent 62%);
54
+ transition: transform .25s ease, background .25s ease;
55
+ pointer-events: none;
56
+ }
57
+ body[data-state="listening"] #halo { background: radial-gradient(circle, rgba(0,212,255,0.30), transparent 60%); transform: scale(1.12); animation: pulse 1.1s ease-in-out infinite; }
58
+ body[data-state="thinking"] #halo { background: radial-gradient(circle, rgba(255,122,24,0.28), transparent 60%); animation: spin 1.4s linear infinite; }
59
+ body[data-state="speaking"] #halo { background: radial-gradient(circle, rgba(54,211,153,0.28), transparent 60%); transform: scale(1.06); }
60
+
61
+ @keyframes pulse { 0%,100%{transform:scale(1.10)} 50%{transform:scale(1.22)} }
62
+ @keyframes spin { to { transform: rotate(360deg) } }
63
+
64
+ #status {
65
+ position: relative;
66
+ z-index: 2;
67
+ font-size: clamp(22px, 6vmin, 46px);
68
+ font-weight: 800;
69
+ letter-spacing: .01em;
70
+ text-shadow: 0 2px 18px rgba(0,0,0,.8);
71
+ }
72
+
73
+ #answer {
74
+ position: relative;
75
+ z-index: 2;
76
+ font-size: clamp(20px, 4.6vmin, 34px);
77
+ font-weight: 600;
78
+ line-height: 1.3;
79
+ max-width: 92vw;
80
+ text-shadow: 0 2px 18px rgba(0,0,0,.9);
81
+ min-height: 1.2em;
82
+ }
83
+
84
+ #hint {
85
+ position: fixed;
86
+ bottom: max(5vmin, env(safe-area-inset-bottom, 0));
87
+ z-index: 2;
88
+ font-size: clamp(14px, 3vmin, 20px);
89
+ opacity: .8;
90
+ text-shadow: 0 2px 12px rgba(0,0,0,.9);
91
+ }
92
+
93
+ #lang {
94
+ position: fixed;
95
+ top: max(4vmin, env(safe-area-inset-top, 0));
96
+ right: 5vmin;
97
+ z-index: 3;
98
+ background: rgba(255,255,255,.12);
99
+ color: var(--fg);
100
+ border: 2px solid rgba(255,255,255,.35);
101
+ border-radius: 999px;
102
+ padding: 10px 18px;
103
+ font-size: 18px;
104
+ font-weight: 700;
105
+ cursor: pointer;
106
+ }
107
+
108
+ #stage.armed { background: rgba(0,212,255,.05); }
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=6
2
+ spaces
3
+ torch
4
+ torchvision
5
+ transformers>=4.49
6
+ accelerate
7
+ qwen-vl-utils
8
+ faster-whisper
9
+ piper-tts
10
+ pillow
11
+ soundfile
12
+ numpy
13
+ huggingface_hub