marcosremar2's picture
Fix audio pop/click by skipping WAV header
64b0a86
|
Raw
History Blame Contribute Delete
11.5 kB

Avatar Interface - WebRTC Streaming com VP9

Visao Geral

Sistema de avatar em tempo real usando WebRTC para streaming de video com baixa latencia. O backend faz toda a fusao de video (idle + lip-sync) e envia um stream unificado para o frontend.

Framework WebRTC: aiortc


Arquitetura

                              FRONTEND (Browser)
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                                     β”‚
                    β”‚   <video autoplay>                  β”‚
                    β”‚                                     β”‚
                    β”‚   Apenas renderiza o stream         β”‚
                    β”‚   WebRTC (VP9 + Opus)               β”‚
                    β”‚                                     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β”‚ WebRTC
                                       β”‚ (VP9 video + Opus audio)
                                       β”‚
═══════════════════════════════════════╧════════════════════════════════════

                              BACKEND (Python + aiortc)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                           β”‚
β”‚                        INTERFACE SERVER (8080)                            β”‚
β”‚                                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  idle.mp4   │────►│   MIXER     │────►│   WebRTC Tracks             β”‚  β”‚
β”‚  β”‚  (frames)   β”‚     β”‚             β”‚     β”‚                             β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Alterna    β”‚     β”‚   AvatarVideoTrack (VP9)    │──┼──► WebRTC
β”‚                      β”‚  idle/speak β”‚     β”‚   AvatarAudioTrack (Opus)   β”‚  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚             β”‚     β”‚                             β”‚  β”‚
β”‚  β”‚  Wav2Lip    │────►│             β”‚     β”‚   25fps, baixa latencia     β”‚  β”‚
β”‚  β”‚  (frames)   β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                          β”‚
β”‚        β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”                                                          β”‚
β”‚  β”‚   Audio     β”‚                                                          β”‚
β”‚  β”‚   Orpheus   β”‚                                                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                                          β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β”‚ WebSocket
                                       β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚          WAV2LIP SERVER (8082)      β”‚
                    β”‚                                     β”‚
                    β”‚  Gera frames de lip-sync            β”‚
                    β”‚  Chama Orpheus TTS internamente     β”‚
                    β”‚                                     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fluxo de Funcionamento

1. Conexao WebRTC

Cliente                          Servidor
   β”‚                                β”‚
   β”‚  POST /offer (SDP offer)       β”‚
   β”‚ ─────────────────────────────► β”‚
   β”‚                                β”‚ Cria RTCPeerConnection
   β”‚                                β”‚ Cria VideoTrack + AudioTrack
   β”‚                                β”‚
   β”‚  SDP answer + session_id       β”‚
   β”‚ ◄───────────────────────────── β”‚
   β”‚                                β”‚
   β”‚  WebRTC conectado              β”‚
   β”‚ ◄════════════════════════════► β”‚ Stream de video comeca
   β”‚                                β”‚ (idle frames em loop)

2. Geracao de Fala

Cliente                          Servidor                    Wav2Lip
   β”‚                                β”‚                           β”‚
   β”‚  POST /generate                β”‚                           β”‚
   β”‚  {text, voice, session_id}     β”‚                           β”‚
   β”‚ ─────────────────────────────► β”‚                           β”‚
   β”‚                                β”‚                           β”‚
   β”‚                                β”‚  WS: generate             β”‚
   β”‚                                β”‚ ────────────────────────► β”‚
   β”‚                                β”‚                           β”‚
   β”‚                                β”‚  frames + audio           β”‚
   β”‚                                β”‚ ◄──────────────────────── β”‚
   β”‚                                β”‚                           β”‚
   β”‚  Stream muda para lip-sync     β”‚                           β”‚
   β”‚ ◄════════════════════════════► β”‚                           β”‚
   β”‚  (video + audio sincronizado)  β”‚                           β”‚
   β”‚                                β”‚                           β”‚
   β”‚  Volta ao idle automaticamente β”‚                           β”‚
   β”‚ ◄════════════════════════════► β”‚                           β”‚

Endpoints da API

POST /offer

Inicia conexao WebRTC (signaling).

Request:

{
  "sdp": "v=0\r\no=- ...",
  "type": "offer"
}

Response:

{
  "sdp": "v=0\r\no=- ...",
  "type": "answer",
  "session_id": "uuid-da-sessao"
}

POST /generate

Gera fala com lip-sync.

Request:

{
  "session_id": "uuid-da-sessao",
  "text": "Hello, I am an avatar!",
  "voice": "tara"
}

Response:

{
  "status": "generating"
}

GET /health

Status do servidor.

Response:

{
  "status": "ok",
  "mode": "webrtc",
  "connections": 2
}

Configuracao de Codec

Video (VP9)

  • Codec: libvpx-vp9
  • FPS: 25
  • Latencia: ~50-100ms
  • Qualidade: Alta (compressao temporal)

Audio (Opus)

  • Codec: Opus
  • Sample Rate: 24000 Hz (resampled para 48000 pelo WebRTC)
  • Canais: Mono
  • Modo: Low delay

Estrutura de Arquivos

/workspace/interface/
β”œβ”€β”€ CLAUDE.md           # Esta documentacao
β”œβ”€β”€ server.py           # Servidor WebRTC com aiortc
β”œβ”€β”€ index.html          # Frontend WebRTC
β”œβ”€β”€ idle.mp4            # Video de idle loop
└── requirements.txt    # Dependencias Python

Dependencias

Python

aiohttp>=3.9.0
aiortc>=1.6.0
opencv-python>=4.8.0
numpy>=1.24.0
av>=10.0.0

Sistema (Ubuntu)

apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev libsrtp2-dev

Como Executar

cd /workspace/interface
pip install -r requirements.txt
python3 server.py

Output esperado: ```

Interface Server - WebRTC VP9 Streaming

Porta: 8080 Idle Video: /workspace/interface/idle.mp4 Wav2Lip: ws://localhost:8082/ws

Endpoints: POST /offer - WebRTC signaling POST /generate - Gerar fala

Carregando idle frames... [Idle] Carregados 1368 frames


---

## Vozes Disponiveis (Orpheus TTS)

| Voice | Genero    |
|-------|-----------|
| tara  | Feminino  |
| leah  | Feminino  |
| jess  | Feminino  |
| leo   | Masculino |
| dan   | Masculino |

---

## Portas

| Servico          | Porta |
|------------------|-------|
| Interface Server | 8080  |
| Orpheus TTS      | 8081  |
| Wav2Lip          | 8082  |

---

## Vantagens do WebRTC sobre WebSocket+JPEG

| Aspecto         | WebSocket+JPEG    | WebRTC+VP9         |
|-----------------|-------------------|---------------------|
| Bandwidth       | ~1.25 MB/s        | ~200 KB/s (6x menos)|
| Latencia        | ~50ms             | ~50-100ms           |
| CPU Browser     | Alta (JS decode)  | Baixa (GPU decode)  |
| Audio/Video     | Separados         | Sincronizados       |
| Qualidade       | Boa               | Excelente           |

---

## Frontend Simplificado

O frontend apenas:
1. Envia offer SDP
2. Recebe answer SDP
3. Renderiza `<video>`
4. Envia texto para /generate

Toda a logica de fusao, encoding e timing esta no backend.

---

## Fixes Importantes

### Audio Pop/Click no Inicio (WAV Header)

**Problema:** O audio do Orpheus TTS vem com um header WAV de 44 bytes. Quando o frontend interpreta esses bytes como dados PCM, causa um ruido/estalo no inicio da reproducao.

**Solucao:** Detectar o header WAV (bytes `RIFF`) e pular os primeiros 44 bytes antes de processar o PCM:

```javascript
// Verificar se tem header WAV (RIFF) e pular se existir
let pcmOffset = 0;
if (bytes.length > 44 &&
    bytes[0] === 0x52 && bytes[1] === 0x49 &&
    bytes[2] === 0x46 && bytes[3] === 0x46) {  // "RIFF"
    console.log('WAV header detected, skipping 44 bytes');
    pcmOffset = 44;
}

const pcmData = new Int16Array(bytes.buffer, pcmOffset);

Adicional: Aplicar fade-in/fade-out suave para evitar qualquer descontinuidade restante:

  • Fade-in: 50ms com curva quadratica
  • Fade-out: 30ms linear

Regras Importantes

  1. NAO ALTERAR ARQUIVOS FORA DE /workspace/interface
  2. Backend faz toda a fusao - Frontend so renderiza
  3. Manter portas fixas - 8080, 8081, 8082
  4. Testar apos cada mudanca - Verificar /health