marcosremar2's picture
Fix audio pop/click by skipping WAV header
64b0a86
|
Raw
History Blame Contribute Delete
11.5 kB
# Avatar Interface - WebRTC Streaming com VP9
## Visao Geral
Sistema de avatar em tempo real usando WebRTC para streaming de video com baixa latencia.
O backend faz toda a fusao de video (idle + lip-sync) e envia um stream unificado para o frontend.
**Framework WebRTC:** [aiortc](https://github.com/aiortc/aiortc)
---
## Arquitetura
```
FRONTEND (Browser)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”‚ <video autoplay> β”‚
β”‚ β”‚
β”‚ Apenas renderiza o stream β”‚
β”‚ WebRTC (VP9 + Opus) β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”‚ WebRTC
β”‚ (VP9 video + Opus audio)
β”‚
═══════════════════════════════════════╧════════════════════════════════════
BACKEND (Python + aiortc)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”‚ INTERFACE SERVER (8080) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ idle.mp4 │────►│ MIXER │────►│ WebRTC Tracks β”‚ β”‚
β”‚ β”‚ (frames) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Alterna β”‚ β”‚ AvatarVideoTrack (VP9) │──┼──► WebRTC
β”‚ β”‚ idle/speak β”‚ β”‚ AvatarAudioTrack (Opus) β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Wav2Lip │────►│ β”‚ β”‚ 25fps, baixa latencia β”‚ β”‚
β”‚ β”‚ (frames) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Audio β”‚ β”‚
β”‚ β”‚ Orpheus β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”‚ WebSocket
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ WAV2LIP SERVER (8082) β”‚
β”‚ β”‚
β”‚ Gera frames de lip-sync β”‚
β”‚ Chama Orpheus TTS internamente β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Fluxo de Funcionamento
### 1. Conexao WebRTC
```
Cliente Servidor
β”‚ β”‚
β”‚ POST /offer (SDP offer) β”‚
β”‚ ─────────────────────────────► β”‚
β”‚ β”‚ Cria RTCPeerConnection
β”‚ β”‚ Cria VideoTrack + AudioTrack
β”‚ β”‚
β”‚ SDP answer + session_id β”‚
β”‚ ◄───────────────────────────── β”‚
β”‚ β”‚
β”‚ WebRTC conectado β”‚
β”‚ ◄════════════════════════════► β”‚ Stream de video comeca
β”‚ β”‚ (idle frames em loop)
```
### 2. Geracao de Fala
```
Cliente Servidor Wav2Lip
β”‚ β”‚ β”‚
β”‚ POST /generate β”‚ β”‚
β”‚ {text, voice, session_id} β”‚ β”‚
β”‚ ─────────────────────────────► β”‚ β”‚
β”‚ β”‚ β”‚
β”‚ β”‚ WS: generate β”‚
β”‚ β”‚ ────────────────────────► β”‚
β”‚ β”‚ β”‚
β”‚ β”‚ frames + audio β”‚
β”‚ β”‚ ◄──────────────────────── β”‚
β”‚ β”‚ β”‚
β”‚ Stream muda para lip-sync β”‚ β”‚
β”‚ ◄════════════════════════════► β”‚ β”‚
β”‚ (video + audio sincronizado) β”‚ β”‚
β”‚ β”‚ β”‚
β”‚ Volta ao idle automaticamente β”‚ β”‚
β”‚ ◄════════════════════════════► β”‚ β”‚
```
---
## Endpoints da API
### POST /offer
Inicia conexao WebRTC (signaling).
**Request:**
```json
{
"sdp": "v=0\r\no=- ...",
"type": "offer"
}
```
**Response:**
```json
{
"sdp": "v=0\r\no=- ...",
"type": "answer",
"session_id": "uuid-da-sessao"
}
```
### POST /generate
Gera fala com lip-sync.
**Request:**
```json
{
"session_id": "uuid-da-sessao",
"text": "Hello, I am an avatar!",
"voice": "tara"
}
```
**Response:**
```json
{
"status": "generating"
}
```
### GET /health
Status do servidor.
**Response:**
```json
{
"status": "ok",
"mode": "webrtc",
"connections": 2
}
```
---
## Configuracao de Codec
### Video (VP9)
- **Codec:** libvpx-vp9
- **FPS:** 25
- **Latencia:** ~50-100ms
- **Qualidade:** Alta (compressao temporal)
### Audio (Opus)
- **Codec:** Opus
- **Sample Rate:** 24000 Hz (resampled para 48000 pelo WebRTC)
- **Canais:** Mono
- **Modo:** Low delay
---
## Estrutura de Arquivos
```
/workspace/interface/
β”œβ”€β”€ CLAUDE.md # Esta documentacao
β”œβ”€β”€ server.py # Servidor WebRTC com aiortc
β”œβ”€β”€ index.html # Frontend WebRTC
β”œβ”€β”€ idle.mp4 # Video de idle loop
└── requirements.txt # Dependencias Python
```
---
## Dependencias
### Python
```
aiohttp>=3.9.0
aiortc>=1.6.0
opencv-python>=4.8.0
numpy>=1.24.0
av>=10.0.0
```
### Sistema (Ubuntu)
```bash
apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev libsrtp2-dev
```
---
## Como Executar
```bash
cd /workspace/interface
pip install -r requirements.txt
python3 server.py
```
**Output esperado:**
```
==================================================
Interface Server - WebRTC VP9 Streaming
==================================================
Porta: 8080
Idle Video: /workspace/interface/idle.mp4
Wav2Lip: ws://localhost:8082/ws
==================================================
Endpoints:
POST /offer - WebRTC signaling
POST /generate - Gerar fala
==================================================
Carregando idle frames...
[Idle] Carregados 1368 frames
==================================================
```
---
## Vozes Disponiveis (Orpheus TTS)
| Voice | Genero |
|-------|-----------|
| tara | Feminino |
| leah | Feminino |
| jess | Feminino |
| leo | Masculino |
| dan | Masculino |
---
## Portas
| Servico | Porta |
|------------------|-------|
| Interface Server | 8080 |
| Orpheus TTS | 8081 |
| Wav2Lip | 8082 |
---
## Vantagens do WebRTC sobre WebSocket+JPEG
| Aspecto | WebSocket+JPEG | WebRTC+VP9 |
|-----------------|-------------------|---------------------|
| Bandwidth | ~1.25 MB/s | ~200 KB/s (6x menos)|
| Latencia | ~50ms | ~50-100ms |
| CPU Browser | Alta (JS decode) | Baixa (GPU decode) |
| Audio/Video | Separados | Sincronizados |
| Qualidade | Boa | Excelente |
---
## Frontend Simplificado
O frontend apenas:
1. Envia offer SDP
2. Recebe answer SDP
3. Renderiza `<video>`
4. Envia texto para /generate
Toda a logica de fusao, encoding e timing esta no backend.
---
## Fixes Importantes
### Audio Pop/Click no Inicio (WAV Header)
**Problema:** O audio do Orpheus TTS vem com um header WAV de 44 bytes. Quando o frontend interpreta esses bytes como dados PCM, causa um ruido/estalo no inicio da reproducao.
**Solucao:** Detectar o header WAV (bytes `RIFF`) e pular os primeiros 44 bytes antes de processar o PCM:
```javascript
// Verificar se tem header WAV (RIFF) e pular se existir
let pcmOffset = 0;
if (bytes.length > 44 &&
bytes[0] === 0x52 && bytes[1] === 0x49 &&
bytes[2] === 0x46 && bytes[3] === 0x46) { // "RIFF"
console.log('WAV header detected, skipping 44 bytes');
pcmOffset = 44;
}
const pcmData = new Int16Array(bytes.buffer, pcmOffset);
```
**Adicional:** Aplicar fade-in/fade-out suave para evitar qualquer descontinuidade restante:
- Fade-in: 50ms com curva quadratica
- Fade-out: 30ms linear
---
## Regras Importantes
1. **NAO ALTERAR ARQUIVOS FORA DE `/workspace/interface`**
2. **Backend faz toda a fusao** - Frontend so renderiza
3. **Manter portas fixas** - 8080, 8081, 8082
4. **Testar apos cada mudanca** - Verificar /health