Spaces:
Running on Zero
Running on Zero
Marcus Ramalho Claude Opus 4.8 commited on
Commit ·
26dae50
0
Parent(s):
Iris: voice-first assistant for the blind (gr.Server + Qwen2.5-VL + Whisper + Piper)
Browse files- gr.Server backend: /describe API (image + optional voice -> PT description + speech)
- voice-first frontend: live camera, tap-to-describe / hold-to-ask, high-contrast, EN/PT
- core/ pipeline vendored (STT Whisper, VLM Qwen2.5-VL-3B, TTS Piper pt_BR)
- runs in-Space on ZeroGPU, <=32B total, no third-party model APIs
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- .gitignore +6 -0
- README.md +43 -0
- app.py +68 -0
- core/__init__.py +0 -0
- core/gpu.py +33 -0
- core/stt.py +34 -0
- core/tts.py +41 -0
- core/vlm.py +103 -0
- frontend/app.js +139 -0
- frontend/index.html +29 -0
- frontend/style.css +108 -0
- requirements.txt +13 -0
.gitignore
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__/
|
| 2 |
+
*.pyc
|
| 3 |
+
.venv/
|
| 4 |
+
.DS_Store
|
| 5 |
+
*.wav
|
| 6 |
+
/tmp/
|
README.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Iris
|
| 3 |
+
emoji: 👁️
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: orange
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 6.17.3
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
short_description: Your father's eyes, by voice. Describe & ask about the world.
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Iris — your father's eyes, by voice
|
| 15 |
+
|
| 16 |
+
Iris is a voice-first assistant for blind and low-vision users. Open it on a
|
| 17 |
+
phone, point the camera, **tap to describe** what's in front of you or **hold to
|
| 18 |
+
ask a question** ("what color is this?", "read this label", "is anyone here?").
|
| 19 |
+
It answers out loud, in Portuguese or English.
|
| 20 |
+
|
| 21 |
+
Built for the **Build Small Hackathon** (Backyard AI track) — for my father.
|
| 22 |
+
|
| 23 |
+
## How it works (all small models, ≤ 32B total)
|
| 24 |
+
- **Speech-to-text:** Whisper small (faster-whisper) — understands the question.
|
| 25 |
+
- **Vision-language:** Qwen2.5-VL-3B-Instruct — describes the scene / reads text, in PT.
|
| 26 |
+
- **Text-to-speech:** Piper (pt_BR) — speaks the answer.
|
| 27 |
+
|
| 28 |
+
Custom voice-first frontend via `gr.Server` (tap-anywhere, high-contrast,
|
| 29 |
+
camera auto-on). Inference runs in-Space on ZeroGPU — **no third-party model APIs**.
|
| 30 |
+
|
| 31 |
+
### Parameter budget
|
| 32 |
+
Whisper-small (~0.24B) + Qwen2.5-VL-3B (~3B) ≈ **~3.3B total** — well under 32B.
|
| 33 |
+
|
| 34 |
+
## Not a mobility aid
|
| 35 |
+
Iris describes the environment and reads text. It is **not** an obstacle-avoidance
|
| 36 |
+
or navigation device and should not be relied on for physical safety.
|
| 37 |
+
|
| 38 |
+
## Run locally
|
| 39 |
+
```bash
|
| 40 |
+
pip install -r requirements.txt
|
| 41 |
+
python app.py # http://localhost:7860
|
| 42 |
+
# IRIS_WARMUP=1 pré-carrega os modelos
|
| 43 |
+
```
|
app.py
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Iris — os olhos do seu pai por voz (HF Space, ZeroGPU).
|
| 2 |
+
|
| 3 |
+
gr.Server: serve um frontend custom voz-first (frontend/index.html) e expõe a
|
| 4 |
+
API `describe` (imagem + pergunta de voz opcional -> resposta em PT + áudio).
|
| 5 |
+
Pipeline: Whisper (STT) -> Qwen2.5-VL (descrição/VQA em PT) -> Piper (TTS).
|
| 6 |
+
"""
|
| 7 |
+
import os
|
| 8 |
+
import tempfile
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
|
| 11 |
+
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
|
| 12 |
+
|
| 13 |
+
from fastapi.responses import HTMLResponse # noqa: E402
|
| 14 |
+
from fastapi.staticfiles import StaticFiles # noqa: E402
|
| 15 |
+
from gradio import Server # noqa: E402
|
| 16 |
+
from gradio.data_classes import FileData # noqa: E402
|
| 17 |
+
|
| 18 |
+
from core import stt, tts, vlm # noqa: E402
|
| 19 |
+
|
| 20 |
+
HERE = Path(__file__).parent
|
| 21 |
+
app = Server(title="Iris")
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def _path(f):
|
| 25 |
+
"""gr.Server entrega FileData como dict; aceita dict ou objeto."""
|
| 26 |
+
if f is None:
|
| 27 |
+
return None
|
| 28 |
+
return f["path"] if isinstance(f, dict) else f.path
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
@app.api(name="describe")
|
| 32 |
+
def describe(image: FileData, audio: FileData | None = None) -> dict:
|
| 33 |
+
"""Recebe um frame da câmera (+ pergunta de voz opcional) e devolve a
|
| 34 |
+
descrição em PT + o áudio falado."""
|
| 35 |
+
apath = _path(audio)
|
| 36 |
+
question = stt.transcribe(apath) if apath else ""
|
| 37 |
+
answer = vlm.describe(_path(image), question)
|
| 38 |
+
if not answer.strip():
|
| 39 |
+
answer = "Não consegui descrever isso. Tente de novo."
|
| 40 |
+
wav = tts.synthesize(answer)
|
| 41 |
+
print(f"[describe] q={question!r} a={answer!r}", flush=True)
|
| 42 |
+
return {
|
| 43 |
+
"question": question,
|
| 44 |
+
"answer": answer,
|
| 45 |
+
"audio": FileData(path=wav) if wav else None,
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
app.mount("/static", StaticFiles(directory=str(HERE / "frontend")), name="static")
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
@app.get("/")
|
| 53 |
+
def index():
|
| 54 |
+
return HTMLResponse((HERE / "frontend" / "index.html").read_text(encoding="utf-8"))
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
if __name__ == "__main__":
|
| 58 |
+
if os.environ.get("IRIS_WARMUP") == "1":
|
| 59 |
+
print("Warmup...", flush=True)
|
| 60 |
+
try:
|
| 61 |
+
vlm.describe(str(HERE.parent / "viability" / "samples" / "indoor.jpg"), "teste")
|
| 62 |
+
stt.transcribe(tts.synthesize("teste"))
|
| 63 |
+
print("Warmup OK", flush=True)
|
| 64 |
+
except Exception as e:
|
| 65 |
+
print("Warmup falhou:", e, flush=True)
|
| 66 |
+
port = int(os.environ.get("GRADIO_SERVER_PORT", os.environ.get("PORT", 7860)))
|
| 67 |
+
app.launch(server_name="0.0.0.0", server_port=port, show_error=True,
|
| 68 |
+
allowed_paths=[tempfile.gettempdir()])
|
core/__init__.py
ADDED
|
File without changes
|
core/gpu.py
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ZeroGPU helpers.
|
| 2 |
+
|
| 3 |
+
`gpu` é um decorator: no HF Spaces (ZeroGPU) ele aloca uma GPU por chamada;
|
| 4 |
+
localmente (sem o pacote `spaces`) vira um no-op transparente. Assim o mesmo
|
| 5 |
+
código roda nas 3060 locais e no Space.
|
| 6 |
+
"""
|
| 7 |
+
import functools
|
| 8 |
+
|
| 9 |
+
try:
|
| 10 |
+
import spaces # disponível só nos Spaces ZeroGPU
|
| 11 |
+
_HAS_SPACES = True
|
| 12 |
+
except Exception:
|
| 13 |
+
_HAS_SPACES = False
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def gpu(duration: int = 60):
|
| 17 |
+
"""Aloca GPU por `duration`s na chamada (ZeroGPU). No-op local."""
|
| 18 |
+
def decorate(fn):
|
| 19 |
+
if _HAS_SPACES:
|
| 20 |
+
return spaces.GPU(duration=duration)(fn)
|
| 21 |
+
|
| 22 |
+
@functools.wraps(fn)
|
| 23 |
+
def wrapper(*args, **kwargs):
|
| 24 |
+
return fn(*args, **kwargs)
|
| 25 |
+
|
| 26 |
+
return wrapper
|
| 27 |
+
|
| 28 |
+
return decorate
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def device() -> str:
|
| 32 |
+
import torch
|
| 33 |
+
return "cuda" if torch.cuda.is_available() else "cpu"
|
core/stt.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Speech-to-text (Whisper via faster-whisper). Português por padrão.
|
| 2 |
+
|
| 3 |
+
Não usa torch — faster-whisper roda em CTranslate2 (GPU se disponível, senão CPU).
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
_model = None
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def _load():
|
| 11 |
+
global _model
|
| 12 |
+
if _model is None:
|
| 13 |
+
from faster_whisper import WhisperModel
|
| 14 |
+
size = os.environ.get("IRIS_STT_MODEL", "small")
|
| 15 |
+
# CTranslate2 precisa das libs CUDA 12 (cublas/cudnn). Se faltarem, CPU.
|
| 16 |
+
device = os.environ.get("IRIS_STT_DEVICE", "cpu")
|
| 17 |
+
if device == "cuda":
|
| 18 |
+
try:
|
| 19 |
+
_model = WhisperModel(size, device="cuda", compute_type="float16")
|
| 20 |
+
except Exception:
|
| 21 |
+
device = "cpu"
|
| 22 |
+
if device != "cuda":
|
| 23 |
+
_model = WhisperModel(size, device="cpu", compute_type="int8")
|
| 24 |
+
return _model
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def transcribe(audio_path: str, language: str = "pt") -> str:
|
| 28 |
+
if not audio_path or not os.path.exists(audio_path):
|
| 29 |
+
print(f"[stt] sem audio: {audio_path!r}", flush=True)
|
| 30 |
+
return ""
|
| 31 |
+
segments, info = _load().transcribe(audio_path, language=language)
|
| 32 |
+
text = " ".join(s.text for s in segments).strip()
|
| 33 |
+
print(f"[stt] {audio_path} ({getattr(info, 'duration', '?')}s) -> {text!r}", flush=True)
|
| 34 |
+
return text
|
core/tts.py
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Text-to-speech (Piper, pt_BR). Retorna caminho de um .wav.
|
| 2 |
+
|
| 3 |
+
Piper roda em onnxruntime (sem torch). A voz é baixada do repo rhasspy/piper-voices.
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
import tempfile
|
| 7 |
+
import wave
|
| 8 |
+
|
| 9 |
+
_voice = None
|
| 10 |
+
_REPO = "rhasspy/piper-voices"
|
| 11 |
+
# voz pt_BR padrão; trocar via IRIS_TTS_VOICE
|
| 12 |
+
_VOICE = os.environ.get("IRIS_TTS_VOICE", "pt/pt_BR/faber/medium/pt_BR-faber-medium")
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def _load():
|
| 16 |
+
global _voice
|
| 17 |
+
if _voice is None:
|
| 18 |
+
from huggingface_hub import hf_hub_download
|
| 19 |
+
from piper import PiperVoice
|
| 20 |
+
onnx = hf_hub_download(_REPO, f"{_VOICE}.onnx")
|
| 21 |
+
conf = hf_hub_download(_REPO, f"{_VOICE}.onnx.json")
|
| 22 |
+
_voice = PiperVoice.load(onnx, config_path=conf)
|
| 23 |
+
return _voice
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def synthesize(text: str) -> str | None:
|
| 27 |
+
if not text or not text.strip():
|
| 28 |
+
return None
|
| 29 |
+
voice = _load()
|
| 30 |
+
chunks = list(voice.synthesize(text))
|
| 31 |
+
if not chunks:
|
| 32 |
+
print(f"[tts] sem áudio para texto: {text!r}", flush=True)
|
| 33 |
+
return None
|
| 34 |
+
path = tempfile.mktemp(suffix=".wav")
|
| 35 |
+
with wave.open(path, "wb") as wf:
|
| 36 |
+
wf.setnchannels(chunks[0].sample_channels)
|
| 37 |
+
wf.setsampwidth(chunks[0].sample_width)
|
| 38 |
+
wf.setframerate(chunks[0].sample_rate)
|
| 39 |
+
for ch in chunks:
|
| 40 |
+
wf.writeframes(ch.audio_int16_bytes)
|
| 41 |
+
return path
|
core/vlm.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Vision-language: descreve / responde sobre uma imagem, em PT.
|
| 2 |
+
|
| 3 |
+
Backend plugável via IRIS_VLM_MODEL:
|
| 4 |
+
- Qwen/Qwen2.5-VL-3B-Instruct (default)
|
| 5 |
+
- openbmb/MiniCPM-V-4.6 -> track OpenBMB ($2.5k); API model.chat()
|
| 6 |
+
O VLM É o gerador do texto que vai pra voz — sem LLM narrador separado.
|
| 7 |
+
"""
|
| 8 |
+
import os
|
| 9 |
+
|
| 10 |
+
_model = None
|
| 11 |
+
_aux = None # processor (qwen) ou tokenizer (minicpm)
|
| 12 |
+
MODEL_ID = os.environ.get("IRIS_VLM_MODEL", "Qwen/Qwen2.5-VL-3B-Instruct")
|
| 13 |
+
|
| 14 |
+
SYSTEM_PT = (
|
| 15 |
+
"Você é os olhos de uma pessoa cega. RESPONDA OBRIGATORIAMENTE EM PORTUGUÊS "
|
| 16 |
+
"DO BRASIL, em no máximo duas frases curtas, dizendo só o que é relevante e "
|
| 17 |
+
"útil sobre a cena. Não comece com 'a imagem mostra'. Se houver texto "
|
| 18 |
+
"importante (rótulo, placa, remédio), leia-o exatamente como está."
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
DOWNSAMPLE = os.environ.get("IRIS_DOWNSAMPLE", "4x") # 4x = detalhe fino (OCR); 16x = rápido
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def _family() -> str:
|
| 25 |
+
return "minicpm" if "minicpm" in MODEL_ID.lower() else "qwen"
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def _load():
|
| 29 |
+
global _model, _aux
|
| 30 |
+
if _model is None:
|
| 31 |
+
import torch
|
| 32 |
+
if _family() == "minicpm":
|
| 33 |
+
from transformers import AutoModelForImageTextToText, AutoProcessor
|
| 34 |
+
_model = AutoModelForImageTextToText.from_pretrained(
|
| 35 |
+
MODEL_ID, trust_remote_code=True, torch_dtype=torch.float16,
|
| 36 |
+
low_cpu_mem_usage=True, device_map="cuda:0",
|
| 37 |
+
).eval()
|
| 38 |
+
_aux = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
|
| 39 |
+
else:
|
| 40 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
| 41 |
+
_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 42 |
+
MODEL_ID, torch_dtype=torch.float16, device_map="cuda:0",
|
| 43 |
+
low_cpu_mem_usage=True,
|
| 44 |
+
).eval()
|
| 45 |
+
_aux = AutoProcessor.from_pretrained(MODEL_ID)
|
| 46 |
+
return _model, _aux
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def _to_pil(image):
|
| 50 |
+
from PIL import Image
|
| 51 |
+
if isinstance(image, str):
|
| 52 |
+
image = Image.open(image)
|
| 53 |
+
elif not isinstance(image, Image.Image):
|
| 54 |
+
image = Image.fromarray(image) # frame numpy da webcam
|
| 55 |
+
image = image.convert("RGB")
|
| 56 |
+
image.thumbnail((1024, 1024)) # menos tokens de visão -> mais rápido; mantém OCR
|
| 57 |
+
return image
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
from .gpu import gpu
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
@gpu(duration=60)
|
| 64 |
+
def describe(image, question: str = "") -> str:
|
| 65 |
+
image = _to_pil(image)
|
| 66 |
+
user = (question or "").strip() or "O que há à minha frente?"
|
| 67 |
+
model, aux = _load()
|
| 68 |
+
|
| 69 |
+
if _family() == "minicpm":
|
| 70 |
+
import torch
|
| 71 |
+
messages = [{"role": "user", "content": [
|
| 72 |
+
{"type": "image", "image": image},
|
| 73 |
+
{"type": "text", "text": f"{SYSTEM_PT}\n\nPergunta: {user}"},
|
| 74 |
+
]}]
|
| 75 |
+
inputs = aux.apply_chat_template(
|
| 76 |
+
messages, tokenize=True, add_generation_prompt=True,
|
| 77 |
+
return_dict=True, return_tensors="pt",
|
| 78 |
+
downsample_mode=DOWNSAMPLE, max_slice_nums=36,
|
| 79 |
+
).to(model.device)
|
| 80 |
+
with torch.no_grad():
|
| 81 |
+
generated = model.generate(**inputs, downsample_mode=DOWNSAMPLE,
|
| 82 |
+
max_new_tokens=96, do_sample=False)
|
| 83 |
+
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, generated)]
|
| 84 |
+
return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
|
| 85 |
+
|
| 86 |
+
# Qwen2.5-VL
|
| 87 |
+
import torch
|
| 88 |
+
from qwen_vl_utils import process_vision_info
|
| 89 |
+
messages = [
|
| 90 |
+
{"role": "system", "content": SYSTEM_PT},
|
| 91 |
+
{"role": "user", "content": [
|
| 92 |
+
{"type": "image", "image": image},
|
| 93 |
+
{"type": "text", "text": user},
|
| 94 |
+
]},
|
| 95 |
+
]
|
| 96 |
+
text = aux.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 97 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 98 |
+
inputs = aux(text=[text], images=image_inputs, videos=video_inputs,
|
| 99 |
+
padding=True, return_tensors="pt").to(model.device)
|
| 100 |
+
with torch.no_grad():
|
| 101 |
+
generated = model.generate(**inputs, max_new_tokens=96, do_sample=False)
|
| 102 |
+
trimmed = generated[:, inputs.input_ids.shape[1]:]
|
| 103 |
+
return aux.batch_decode(trimmed, skip_special_tokens=True)[0].strip()
|
frontend/app.js
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import { Client, handle_file } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js";
|
| 2 |
+
|
| 3 |
+
const $ = (id) => document.getElementById(id);
|
| 4 |
+
const cam = $("cam"), canvas = $("canvas"), statusEl = $("status"),
|
| 5 |
+
answerEl = $("answer"), hintEl = $("hint"), langBtn = $("lang"),
|
| 6 |
+
player = $("player"), stage = $("stage");
|
| 7 |
+
|
| 8 |
+
// ---- i18n (EN default + PT) ----
|
| 9 |
+
let lang = "en";
|
| 10 |
+
const T = {
|
| 11 |
+
en: { idle: "Iris", listening: "Listening…", thinking: "Thinking…",
|
| 12 |
+
hint: "Tap to describe · hold to ask", camErr: "Camera blocked — allow access",
|
| 13 |
+
err: "Something went wrong" },
|
| 14 |
+
pt: { idle: "Iris", listening: "Ouvindo…", thinking: "Pensando…",
|
| 15 |
+
hint: "Toque para descrever · segure para perguntar", camErr: "Câmera bloqueada — permita o acesso",
|
| 16 |
+
err: "Algo deu errado" },
|
| 17 |
+
};
|
| 18 |
+
const t = (k) => T[lang][k];
|
| 19 |
+
|
| 20 |
+
function setState(s, msg) {
|
| 21 |
+
document.body.dataset.state = s || "";
|
| 22 |
+
if (msg !== undefined) statusEl.textContent = msg;
|
| 23 |
+
}
|
| 24 |
+
|
| 25 |
+
langBtn.onclick = (e) => {
|
| 26 |
+
e.stopPropagation();
|
| 27 |
+
lang = lang === "en" ? "pt" : "en";
|
| 28 |
+
langBtn.textContent = lang.toUpperCase();
|
| 29 |
+
hintEl.textContent = t("hint");
|
| 30 |
+
document.documentElement.lang = lang;
|
| 31 |
+
if (!busy) statusEl.textContent = t("idle");
|
| 32 |
+
};
|
| 33 |
+
|
| 34 |
+
// ---- câmera ao vivo (auto) ----
|
| 35 |
+
async function startCamera() {
|
| 36 |
+
const tries = [
|
| 37 |
+
{ video: { facingMode: { ideal: "environment" } }, audio: false },
|
| 38 |
+
{ video: true, audio: false },
|
| 39 |
+
];
|
| 40 |
+
for (const c of tries) {
|
| 41 |
+
try { cam.srcObject = await navigator.mediaDevices.getUserMedia(c); return; }
|
| 42 |
+
catch (e) { /* tenta o próximo */ }
|
| 43 |
+
}
|
| 44 |
+
setState("", t("camErr"));
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
function grabFrame() {
|
| 48 |
+
const w = cam.videoWidth, h = cam.videoHeight;
|
| 49 |
+
if (!w || !h) return Promise.resolve(null);
|
| 50 |
+
canvas.width = w; canvas.height = h;
|
| 51 |
+
canvas.getContext("2d").drawImage(cam, 0, 0, w, h);
|
| 52 |
+
return new Promise((res) => canvas.toBlob(res, "image/jpeg", 0.85));
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
// ---- microfone (grava on-hold) ----
|
| 56 |
+
let micStream = null, recorder = null, chunks = [];
|
| 57 |
+
async function startRec() {
|
| 58 |
+
try { if (!micStream) micStream = await navigator.mediaDevices.getUserMedia({ audio: true }); }
|
| 59 |
+
catch (e) { return false; }
|
| 60 |
+
chunks = [];
|
| 61 |
+
recorder = new MediaRecorder(micStream);
|
| 62 |
+
recorder.ondataavailable = (e) => { if (e.data.size) chunks.push(e.data); };
|
| 63 |
+
recorder.start();
|
| 64 |
+
return true;
|
| 65 |
+
}
|
| 66 |
+
function stopRec() {
|
| 67 |
+
return new Promise((res) => {
|
| 68 |
+
if (!recorder || recorder.state === "inactive") return res(null);
|
| 69 |
+
recorder.onstop = () => res(chunks.length ? new Blob(chunks, { type: recorder.mimeType || "audio/webm" }) : null);
|
| 70 |
+
recorder.stop();
|
| 71 |
+
});
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
// ---- backend ----
|
| 75 |
+
let client = null;
|
| 76 |
+
let busy = false;
|
| 77 |
+
|
| 78 |
+
async function send(frame, audio) {
|
| 79 |
+
if (busy) return;
|
| 80 |
+
busy = true;
|
| 81 |
+
answerEl.textContent = "";
|
| 82 |
+
setState("thinking", t("thinking"));
|
| 83 |
+
try {
|
| 84 |
+
const payload = { image: handle_file(frame) };
|
| 85 |
+
if (audio) payload.audio = handle_file(audio);
|
| 86 |
+
const result = await client.predict("/describe", payload);
|
| 87 |
+
const out = Array.isArray(result.data) ? result.data[0] : result.data;
|
| 88 |
+
answerEl.textContent = (out && out.answer) || "";
|
| 89 |
+
setState("speaking", "");
|
| 90 |
+
const url = out && out.audio && (out.audio.url || out.audio.path);
|
| 91 |
+
if (url) { player.src = url; await player.play().catch(() => {}); }
|
| 92 |
+
else { resetSoon(); }
|
| 93 |
+
} catch (e) {
|
| 94 |
+
console.error(e);
|
| 95 |
+
setState("", t("err"));
|
| 96 |
+
resetSoon();
|
| 97 |
+
} finally {
|
| 98 |
+
busy = false;
|
| 99 |
+
}
|
| 100 |
+
}
|
| 101 |
+
function resetSoon() { setTimeout(() => { if (!busy) setState("", t("idle")); }, 600); }
|
| 102 |
+
player.addEventListener("ended", () => { if (!busy) setState("", t("idle")); });
|
| 103 |
+
|
| 104 |
+
// ---- interação: tap = descrever · segurar = perguntar ----
|
| 105 |
+
let holding = false, recording = false, holdTimer = null;
|
| 106 |
+
const HOLD_MS = 350;
|
| 107 |
+
|
| 108 |
+
stage.addEventListener("pointerdown", () => {
|
| 109 |
+
if (busy) return;
|
| 110 |
+
holding = true;
|
| 111 |
+
stage.classList.add("armed");
|
| 112 |
+
holdTimer = setTimeout(async () => {
|
| 113 |
+
recording = await startRec();
|
| 114 |
+
if (recording) setState("listening", t("listening"));
|
| 115 |
+
}, HOLD_MS);
|
| 116 |
+
});
|
| 117 |
+
|
| 118 |
+
async function endPress() {
|
| 119 |
+
if (!holding) return;
|
| 120 |
+
holding = false;
|
| 121 |
+
stage.classList.remove("armed");
|
| 122 |
+
clearTimeout(holdTimer);
|
| 123 |
+
const frame = await grabFrame();
|
| 124 |
+
if (!frame) { setState("", t("camErr")); return; }
|
| 125 |
+
if (recording) { const audio = await stopRec(); recording = false; send(frame, audio); }
|
| 126 |
+
else { send(frame, null); } // tap rápido = só descrever
|
| 127 |
+
}
|
| 128 |
+
stage.addEventListener("pointerup", endPress);
|
| 129 |
+
stage.addEventListener("pointercancel", endPress);
|
| 130 |
+
|
| 131 |
+
// ---- boot ----
|
| 132 |
+
(async () => {
|
| 133 |
+
langBtn.textContent = lang.toUpperCase();
|
| 134 |
+
hintEl.textContent = t("hint");
|
| 135 |
+
statusEl.textContent = t("idle");
|
| 136 |
+
await startCamera();
|
| 137 |
+
try { client = await Client.connect(window.location.origin); }
|
| 138 |
+
catch (e) { console.error(e); setState("", t("err")); }
|
| 139 |
+
})();
|
frontend/index.html
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="utf-8" />
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, viewport-fit=cover" />
|
| 6 |
+
<title>Iris</title>
|
| 7 |
+
<link rel="stylesheet" href="/static/style.css" />
|
| 8 |
+
</head>
|
| 9 |
+
<body>
|
| 10 |
+
<!-- câmera ao vivo (fundo) -->
|
| 11 |
+
<video id="cam" autoplay playsinline muted></video>
|
| 12 |
+
<canvas id="canvas" hidden></canvas>
|
| 13 |
+
|
| 14 |
+
<!-- alvo único: toda a tela é o botão -->
|
| 15 |
+
<main id="stage" role="application" aria-label="Iris">
|
| 16 |
+
<div id="halo"></div>
|
| 17 |
+
<p id="status" aria-live="assertive">Iris</p>
|
| 18 |
+
<p id="answer" aria-live="polite"></p>
|
| 19 |
+
<p id="hint">Toque para descrever · segure para perguntar</p>
|
| 20 |
+
</main>
|
| 21 |
+
|
| 22 |
+
<!-- idioma -->
|
| 23 |
+
<button id="lang" aria-label="Idioma">PT</button>
|
| 24 |
+
|
| 25 |
+
<audio id="player" playsinline></audio>
|
| 26 |
+
|
| 27 |
+
<script type="module" src="/static/app.js"></script>
|
| 28 |
+
</body>
|
| 29 |
+
</html>
|
frontend/style.css
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
:root {
|
| 2 |
+
--bg: #05060a;
|
| 3 |
+
--fg: #ffffff;
|
| 4 |
+
--accent: #ff7a18;
|
| 5 |
+
--accent-2: #00d4ff;
|
| 6 |
+
--ok: #36d399;
|
| 7 |
+
}
|
| 8 |
+
|
| 9 |
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
| 10 |
+
|
| 11 |
+
html, body {
|
| 12 |
+
height: 100%;
|
| 13 |
+
background: var(--bg);
|
| 14 |
+
color: var(--fg);
|
| 15 |
+
font-family: system-ui, -apple-system, "Segoe UI", Roboto, sans-serif;
|
| 16 |
+
overflow: hidden;
|
| 17 |
+
-webkit-tap-highlight-color: transparent;
|
| 18 |
+
user-select: none;
|
| 19 |
+
}
|
| 20 |
+
|
| 21 |
+
/* câmera ao vivo como fundo, escurecida pra contraste do texto */
|
| 22 |
+
#cam {
|
| 23 |
+
position: fixed;
|
| 24 |
+
inset: 0;
|
| 25 |
+
width: 100%;
|
| 26 |
+
height: 100%;
|
| 27 |
+
object-fit: cover;
|
| 28 |
+
filter: brightness(0.45) saturate(1.05);
|
| 29 |
+
z-index: 0;
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
/* a tela inteira é o botão */
|
| 33 |
+
#stage {
|
| 34 |
+
position: fixed;
|
| 35 |
+
inset: 0;
|
| 36 |
+
z-index: 1;
|
| 37 |
+
display: flex;
|
| 38 |
+
flex-direction: column;
|
| 39 |
+
align-items: center;
|
| 40 |
+
justify-content: center;
|
| 41 |
+
text-align: center;
|
| 42 |
+
padding: 6vmin;
|
| 43 |
+
gap: 3vmin;
|
| 44 |
+
cursor: pointer;
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
/* halo central que reage ao estado */
|
| 48 |
+
#halo {
|
| 49 |
+
position: absolute;
|
| 50 |
+
width: 64vmin;
|
| 51 |
+
height: 64vmin;
|
| 52 |
+
border-radius: 50%;
|
| 53 |
+
background: radial-gradient(circle, rgba(255,122,24,0.18), transparent 62%);
|
| 54 |
+
transition: transform .25s ease, background .25s ease;
|
| 55 |
+
pointer-events: none;
|
| 56 |
+
}
|
| 57 |
+
body[data-state="listening"] #halo { background: radial-gradient(circle, rgba(0,212,255,0.30), transparent 60%); transform: scale(1.12); animation: pulse 1.1s ease-in-out infinite; }
|
| 58 |
+
body[data-state="thinking"] #halo { background: radial-gradient(circle, rgba(255,122,24,0.28), transparent 60%); animation: spin 1.4s linear infinite; }
|
| 59 |
+
body[data-state="speaking"] #halo { background: radial-gradient(circle, rgba(54,211,153,0.28), transparent 60%); transform: scale(1.06); }
|
| 60 |
+
|
| 61 |
+
@keyframes pulse { 0%,100%{transform:scale(1.10)} 50%{transform:scale(1.22)} }
|
| 62 |
+
@keyframes spin { to { transform: rotate(360deg) } }
|
| 63 |
+
|
| 64 |
+
#status {
|
| 65 |
+
position: relative;
|
| 66 |
+
z-index: 2;
|
| 67 |
+
font-size: clamp(22px, 6vmin, 46px);
|
| 68 |
+
font-weight: 800;
|
| 69 |
+
letter-spacing: .01em;
|
| 70 |
+
text-shadow: 0 2px 18px rgba(0,0,0,.8);
|
| 71 |
+
}
|
| 72 |
+
|
| 73 |
+
#answer {
|
| 74 |
+
position: relative;
|
| 75 |
+
z-index: 2;
|
| 76 |
+
font-size: clamp(20px, 4.6vmin, 34px);
|
| 77 |
+
font-weight: 600;
|
| 78 |
+
line-height: 1.3;
|
| 79 |
+
max-width: 92vw;
|
| 80 |
+
text-shadow: 0 2px 18px rgba(0,0,0,.9);
|
| 81 |
+
min-height: 1.2em;
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
#hint {
|
| 85 |
+
position: fixed;
|
| 86 |
+
bottom: max(5vmin, env(safe-area-inset-bottom, 0));
|
| 87 |
+
z-index: 2;
|
| 88 |
+
font-size: clamp(14px, 3vmin, 20px);
|
| 89 |
+
opacity: .8;
|
| 90 |
+
text-shadow: 0 2px 12px rgba(0,0,0,.9);
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
#lang {
|
| 94 |
+
position: fixed;
|
| 95 |
+
top: max(4vmin, env(safe-area-inset-top, 0));
|
| 96 |
+
right: 5vmin;
|
| 97 |
+
z-index: 3;
|
| 98 |
+
background: rgba(255,255,255,.12);
|
| 99 |
+
color: var(--fg);
|
| 100 |
+
border: 2px solid rgba(255,255,255,.35);
|
| 101 |
+
border-radius: 999px;
|
| 102 |
+
padding: 10px 18px;
|
| 103 |
+
font-size: 18px;
|
| 104 |
+
font-weight: 700;
|
| 105 |
+
cursor: pointer;
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
#stage.armed { background: rgba(0,212,255,.05); }
|
requirements.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=6
|
| 2 |
+
spaces
|
| 3 |
+
torch
|
| 4 |
+
torchvision
|
| 5 |
+
transformers>=4.49
|
| 6 |
+
accelerate
|
| 7 |
+
qwen-vl-utils
|
| 8 |
+
faster-whisper
|
| 9 |
+
piper-tts
|
| 10 |
+
pillow
|
| 11 |
+
soundfile
|
| 12 |
+
numpy
|
| 13 |
+
huggingface_hub
|