Avatar Interface - WebRTC Streaming com VP9
Visao Geral
Sistema de avatar em tempo real usando WebRTC para streaming de video com baixa latencia. O backend faz toda a fusao de video (idle + lip-sync) e envia um stream unificado para o frontend.
Framework WebRTC: aiortc
Arquitetura
FRONTEND (Browser)
βββββββββββββββββββββββββββββββββββββββ
β β
β <video autoplay> β
β β
β Apenas renderiza o stream β
β WebRTC (VP9 + Opus) β
β β
ββββββββββββββββββββ²βββββββββββββββββββ
β
β WebRTC
β (VP9 video + Opus audio)
β
ββββββββββββββββββββββββββββββββββββββββ§ββββββββββββββββββββββββββββββββββββ
BACKEND (Python + aiortc)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INTERFACE SERVER (8080) β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββββββ β
β β idle.mp4 ββββββΊβ MIXER ββββββΊβ WebRTC Tracks β β
β β (frames) β β β β β β
β βββββββββββββββ β Alterna β β AvatarVideoTrack (VP9) ββββΌβββΊ WebRTC
β β idle/speak β β AvatarAudioTrack (Opus) β β
β βββββββββββββββ β β β β β
β β Wav2Lip ββββββΊβ β β 25fps, baixa latencia β β
β β (frames) β βββββββββββββββ βββββββββββββββββββββββββββββββ β
β βββββββββββββββ β
β β β
β βββββββΌββββββββ β
β β Audio β β
β β Orpheus β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β WebSocket
βΌ
βββββββββββββββββββββββββββββββββββββββ
β WAV2LIP SERVER (8082) β
β β
β Gera frames de lip-sync β
β Chama Orpheus TTS internamente β
β β
βββββββββββββββββββββββββββββββββββββββ
Fluxo de Funcionamento
1. Conexao WebRTC
Cliente Servidor
β β
β POST /offer (SDP offer) β
β ββββββββββββββββββββββββββββββΊ β
β β Cria RTCPeerConnection
β β Cria VideoTrack + AudioTrack
β β
β SDP answer + session_id β
β ββββββββββββββββββββββββββββββ β
β β
β WebRTC conectado β
β ββββββββββββββββββββββββββββββΊ β Stream de video comeca
β β (idle frames em loop)
2. Geracao de Fala
Cliente Servidor Wav2Lip
β β β
β POST /generate β β
β {text, voice, session_id} β β
β ββββββββββββββββββββββββββββββΊ β β
β β β
β β WS: generate β
β β βββββββββββββββββββββββββΊ β
β β β
β β frames + audio β
β β βββββββββββββββββββββββββ β
β β β
β Stream muda para lip-sync β β
β ββββββββββββββββββββββββββββββΊ β β
β (video + audio sincronizado) β β
β β β
β Volta ao idle automaticamente β β
β ββββββββββββββββββββββββββββββΊ β β
Endpoints da API
POST /offer
Inicia conexao WebRTC (signaling).
Request:
{
"sdp": "v=0\r\no=- ...",
"type": "offer"
}
Response:
{
"sdp": "v=0\r\no=- ...",
"type": "answer",
"session_id": "uuid-da-sessao"
}
POST /generate
Gera fala com lip-sync.
Request:
{
"session_id": "uuid-da-sessao",
"text": "Hello, I am an avatar!",
"voice": "tara"
}
Response:
{
"status": "generating"
}
GET /health
Status do servidor.
Response:
{
"status": "ok",
"mode": "webrtc",
"connections": 2
}
Configuracao de Codec
Video (VP9)
- Codec: libvpx-vp9
- FPS: 25
- Latencia: ~50-100ms
- Qualidade: Alta (compressao temporal)
Audio (Opus)
- Codec: Opus
- Sample Rate: 24000 Hz (resampled para 48000 pelo WebRTC)
- Canais: Mono
- Modo: Low delay
Estrutura de Arquivos
/workspace/interface/
βββ CLAUDE.md # Esta documentacao
βββ server.py # Servidor WebRTC com aiortc
βββ index.html # Frontend WebRTC
βββ idle.mp4 # Video de idle loop
βββ requirements.txt # Dependencias Python
Dependencias
Python
aiohttp>=3.9.0
aiortc>=1.6.0
opencv-python>=4.8.0
numpy>=1.24.0
av>=10.0.0
Sistema (Ubuntu)
apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev libsrtp2-dev
Como Executar
cd /workspace/interface
pip install -r requirements.txt
python3 server.py
Output esperado: ```
Interface Server - WebRTC VP9 Streaming
Porta: 8080 Idle Video: /workspace/interface/idle.mp4 Wav2Lip: ws://localhost:8082/ws
Endpoints: POST /offer - WebRTC signaling POST /generate - Gerar fala
Carregando idle frames... [Idle] Carregados 1368 frames
---
## Vozes Disponiveis (Orpheus TTS)
| Voice | Genero |
|-------|-----------|
| tara | Feminino |
| leah | Feminino |
| jess | Feminino |
| leo | Masculino |
| dan | Masculino |
---
## Portas
| Servico | Porta |
|------------------|-------|
| Interface Server | 8080 |
| Orpheus TTS | 8081 |
| Wav2Lip | 8082 |
---
## Vantagens do WebRTC sobre WebSocket+JPEG
| Aspecto | WebSocket+JPEG | WebRTC+VP9 |
|-----------------|-------------------|---------------------|
| Bandwidth | ~1.25 MB/s | ~200 KB/s (6x menos)|
| Latencia | ~50ms | ~50-100ms |
| CPU Browser | Alta (JS decode) | Baixa (GPU decode) |
| Audio/Video | Separados | Sincronizados |
| Qualidade | Boa | Excelente |
---
## Frontend Simplificado
O frontend apenas:
1. Envia offer SDP
2. Recebe answer SDP
3. Renderiza `<video>`
4. Envia texto para /generate
Toda a logica de fusao, encoding e timing esta no backend.
---
## Fixes Importantes
### Audio Pop/Click no Inicio (WAV Header)
**Problema:** O audio do Orpheus TTS vem com um header WAV de 44 bytes. Quando o frontend interpreta esses bytes como dados PCM, causa um ruido/estalo no inicio da reproducao.
**Solucao:** Detectar o header WAV (bytes `RIFF`) e pular os primeiros 44 bytes antes de processar o PCM:
```javascript
// Verificar se tem header WAV (RIFF) e pular se existir
let pcmOffset = 0;
if (bytes.length > 44 &&
bytes[0] === 0x52 && bytes[1] === 0x49 &&
bytes[2] === 0x46 && bytes[3] === 0x46) { // "RIFF"
console.log('WAV header detected, skipping 44 bytes');
pcmOffset = 44;
}
const pcmData = new Int16Array(bytes.buffer, pcmOffset);
Adicional: Aplicar fade-in/fade-out suave para evitar qualquer descontinuidade restante:
- Fade-in: 50ms com curva quadratica
- Fade-out: 30ms linear
Regras Importantes
- NAO ALTERAR ARQUIVOS FORA DE
/workspace/interface - Backend faz toda a fusao - Frontend so renderiza
- Manter portas fixas - 8080, 8081, 8082
- Testar apos cada mudanca - Verificar /health