| # Avatar Interface - WebRTC Streaming com VP9 |
|
|
| ## Visao Geral |
|
|
| Sistema de avatar em tempo real usando WebRTC para streaming de video com baixa latencia. |
| O backend faz toda a fusao de video (idle + lip-sync) e envia um stream unificado para o frontend. |
|
|
| **Framework WebRTC:** [aiortc](https://github.com/aiortc/aiortc) |
|
|
| --- |
|
|
| ## Arquitetura |
|
|
| ``` |
| FRONTEND (Browser) |
| βββββββββββββββββββββββββββββββββββββββ |
| β β |
| β <video autoplay> β |
| β β |
| β Apenas renderiza o stream β |
| β WebRTC (VP9 + Opus) β |
| β β |
| ββββββββββββββββββββ²βββββββββββββββββββ |
| β |
| β WebRTC |
| β (VP9 video + Opus audio) |
| β |
| ββββββββββββββββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββ |
| |
| BACKEND (Python + aiortc) |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β β |
| β INTERFACE SERVER (8080) β |
| β β |
| β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββββ β |
| β β idle.mp4 ββββββΊβ MIXER ββββββΊβ WebRTC Tracks β β |
| β β (frames) β β β β β β |
| β βββββββββββββββ β Alterna β β AvatarVideoTrack (VP9) ββββΌβββΊ WebRTC |
| β β idle/speak β β AvatarAudioTrack (Opus) β β |
| β βββββββββββββββ β β β β β |
| β β Wav2Lip ββββββΊβ β β 25fps, baixa latencia β β |
| β β (frames) β βββββββββββββββ βββββββββββββββββββββββββββββββ β |
| β βββββββββββββββ β |
| β β β |
| β βββββββΌββββββββ β |
| β β Audio β β |
| β β Orpheus β β |
| β βββββββββββββββ β |
| β β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| β WebSocket |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββ |
| β WAV2LIP SERVER (8082) β |
| β β |
| β Gera frames de lip-sync β |
| β Chama Orpheus TTS internamente β |
| β β |
| βββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| --- |
|
|
| ## Fluxo de Funcionamento |
|
|
| ### 1. Conexao WebRTC |
|
|
| ``` |
| Cliente Servidor |
| β β |
| β POST /offer (SDP offer) β |
| β ββββββββββββββββββββββββββββββΊ β |
| β β Cria RTCPeerConnection |
| β β Cria VideoTrack + AudioTrack |
| β β |
| β SDP answer + session_id β |
| β ββββββββββββββββββββββββββββββ β |
| β β |
| β WebRTC conectado β |
| β ββββββββββββββββββββββββββββββΊ β Stream de video comeca |
| β β (idle frames em loop) |
| ``` |
|
|
| ### 2. Geracao de Fala |
|
|
| ``` |
| Cliente Servidor Wav2Lip |
| β β β |
| β POST /generate β β |
| β {text, voice, session_id} β β |
| β ββββββββββββββββββββββββββββββΊ β β |
| β β β |
| β β WS: generate β |
| β β βββββββββββββββββββββββββΊ β |
| β β β |
| β β frames + audio β |
| β β βββββββββββββββββββββββββ β |
| β β β |
| β Stream muda para lip-sync β β |
| β ββββββββββββββββββββββββββββββΊ β β |
| β (video + audio sincronizado) β β |
| β β β |
| β Volta ao idle automaticamente β β |
| β ββββββββββββββββββββββββββββββΊ β β |
| ``` |
|
|
| --- |
|
|
| ## Endpoints da API |
|
|
| ### POST /offer |
| Inicia conexao WebRTC (signaling). |
|
|
| **Request:** |
| ```json |
| { |
| "sdp": "v=0\r\no=- ...", |
| "type": "offer" |
| } |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "sdp": "v=0\r\no=- ...", |
| "type": "answer", |
| "session_id": "uuid-da-sessao" |
| } |
| ``` |
|
|
| ### POST /generate |
| Gera fala com lip-sync. |
|
|
| **Request:** |
| ```json |
| { |
| "session_id": "uuid-da-sessao", |
| "text": "Hello, I am an avatar!", |
| "voice": "tara" |
| } |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "status": "generating" |
| } |
| ``` |
|
|
| ### GET /health |
| Status do servidor. |
|
|
| **Response:** |
| ```json |
| { |
| "status": "ok", |
| "mode": "webrtc", |
| "connections": 2 |
| } |
| ``` |
|
|
| --- |
|
|
| ## Configuracao de Codec |
|
|
| ### Video (VP9) |
| - **Codec:** libvpx-vp9 |
| - **FPS:** 25 |
| - **Latencia:** ~50-100ms |
| - **Qualidade:** Alta (compressao temporal) |
|
|
| ### Audio (Opus) |
| - **Codec:** Opus |
| - **Sample Rate:** 24000 Hz (resampled para 48000 pelo WebRTC) |
| - **Canais:** Mono |
| - **Modo:** Low delay |
|
|
| --- |
|
|
| ## Estrutura de Arquivos |
|
|
| ``` |
| /workspace/interface/ |
| βββ CLAUDE.md # Esta documentacao |
| βββ server.py # Servidor WebRTC com aiortc |
| βββ index.html # Frontend WebRTC |
| βββ idle.mp4 # Video de idle loop |
| βββ requirements.txt # Dependencias Python |
| ``` |
|
|
| --- |
|
|
| ## Dependencias |
|
|
| ### Python |
| ``` |
| aiohttp>=3.9.0 |
| aiortc>=1.6.0 |
| opencv-python>=4.8.0 |
| numpy>=1.24.0 |
| av>=10.0.0 |
| ``` |
|
|
| ### Sistema (Ubuntu) |
| ```bash |
| apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev libsrtp2-dev |
| ``` |
|
|
| --- |
|
|
| ## Como Executar |
|
|
| ```bash |
| cd /workspace/interface |
| pip install -r requirements.txt |
| python3 server.py |
| ``` |
|
|
| **Output esperado:** |
| ``` |
| ================================================== |
| Interface Server - WebRTC VP9 Streaming |
| ================================================== |
| Porta: 8080 |
| Idle Video: /workspace/interface/idle.mp4 |
| Wav2Lip: ws://localhost:8082/ws |
| ================================================== |
| Endpoints: |
| POST /offer - WebRTC signaling |
| POST /generate - Gerar fala |
| ================================================== |
| Carregando idle frames... |
| [Idle] Carregados 1368 frames |
| ================================================== |
| ``` |
| |
| --- |
| |
| ## Vozes Disponiveis (Orpheus TTS) |
| |
| | Voice | Genero | |
| |-------|-----------| |
| | tara | Feminino | |
| | leah | Feminino | |
| | jess | Feminino | |
| | leo | Masculino | |
| | dan | Masculino | |
| |
| --- |
| |
| ## Portas |
| |
| | Servico | Porta | |
| |------------------|-------| |
| | Interface Server | 8080 | |
| | Orpheus TTS | 8081 | |
| | Wav2Lip | 8082 | |
| |
| --- |
| |
| ## Vantagens do WebRTC sobre WebSocket+JPEG |
| |
| | Aspecto | WebSocket+JPEG | WebRTC+VP9 | |
| |-----------------|-------------------|---------------------| |
| | Bandwidth | ~1.25 MB/s | ~200 KB/s (6x menos)| |
| | Latencia | ~50ms | ~50-100ms | |
| | CPU Browser | Alta (JS decode) | Baixa (GPU decode) | |
| | Audio/Video | Separados | Sincronizados | |
| | Qualidade | Boa | Excelente | |
| |
| --- |
| |
| ## Frontend Simplificado |
| |
| O frontend apenas: |
| 1. Envia offer SDP |
| 2. Recebe answer SDP |
| 3. Renderiza `<video>` |
| 4. Envia texto para /generate |
| |
| Toda a logica de fusao, encoding e timing esta no backend. |
| |
| --- |
| |
| ## Fixes Importantes |
| |
| ### Audio Pop/Click no Inicio (WAV Header) |
| |
| **Problema:** O audio do Orpheus TTS vem com um header WAV de 44 bytes. Quando o frontend interpreta esses bytes como dados PCM, causa um ruido/estalo no inicio da reproducao. |
| |
| **Solucao:** Detectar o header WAV (bytes `RIFF`) e pular os primeiros 44 bytes antes de processar o PCM: |
| |
| ```javascript |
| // Verificar se tem header WAV (RIFF) e pular se existir |
| let pcmOffset = 0; |
| if (bytes.length > 44 && |
| bytes[0] === 0x52 && bytes[1] === 0x49 && |
| bytes[2] === 0x46 && bytes[3] === 0x46) { // "RIFF" |
| console.log('WAV header detected, skipping 44 bytes'); |
| pcmOffset = 44; |
| } |
| |
| const pcmData = new Int16Array(bytes.buffer, pcmOffset); |
| ``` |
| |
| **Adicional:** Aplicar fade-in/fade-out suave para evitar qualquer descontinuidade restante: |
| - Fade-in: 50ms com curva quadratica |
| - Fade-out: 30ms linear |
| |
| --- |
| |
| ## Regras Importantes |
| |
| 1. **NAO ALTERAR ARQUIVOS FORA DE `/workspace/interface`** |
| 2. **Backend faz toda a fusao** - Frontend so renderiza |
| 3. **Manter portas fixas** - 8080, 8081, 8082 |
| 4. **Testar apos cada mudanca** - Verificar /health |
| |