Fix audio pop/click by skipping WAV header

- Detect and skip 44-byte WAV header from Orpheus TTS audio
- The header was being interpreted as PCM data causing noise
- Add fade-in (50ms quadratic) and fade-out (30ms) for smoother audio
- Document fix in CLAUDE.md for future reference

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (2) hide show

interface/CLAUDE.md +228 -323
interface/index_streaming.html +25 -1

interface/CLAUDE.md CHANGED Viewed

@@ -1,384 +1,243 @@
-# Avatar Interface - Sistema de Streaming de Video com Lip Sync
-## ARQUITETURA PRINCIPAL
-```
-┌─────────────┐      ┌─────────────────────────────────┐      ┌─────────────────┐
-│  Frontend   │◄────►│      Interface Server           │◄────►│  Orpheus TTS    │
-│  (Browser)  │  WS  │        (porta 8080)             │  WS  │  (porta 8081)   │
-└─────────────┘      │                                 │      │  chunks audio   │
-                     │  1. Recebe texto do frontend    │      └─────────────────┘
-                     │  2. Conecta Orpheus + Wav2Lip   │
-                     │     EM PARALELO                 │      ┌─────────────────┐
-                     │  3. Recebe chunks conforme      │◄────►│  Wav2Lip        │
-                     │     chegam de ambos             │  WS  │  (porta 8082)   │
-                     │  4. Monta: audio Orpheus +      │      │  frames JPEG    │
-                     │     frames Wav2Lip              │      │  (NAO MODIFICAR)│
-                     │  5. Envia chunk IMEDIATAMENTE   │      └─────────────────┘
-                     └─────────────────────────────────┘
-```
-## FLUXO DE STREAMING
-```
-1. Frontend envia: {"action": "generate", "text": "Hello", "voice": "tara"}
-   │
-   ▼
-2. Interface conecta EM PARALELO:
-   ├── Orpheus WS (8081) → recebe chunks de audio PCM 24kHz
-   └── Wav2Lip WS (8082) → recebe frames JPEG (lip sync com eSpeak interno)
-   │
-   ▼
-3. Conforme dados chegam, acumula em buffers:
-   - audio_buffer: chunks PCM do Orpheus
-   - frame_buffer: frames JPEG do Wav2Lip
-   │
-   ▼
-4. Quando tem 1 frame + audio correspondente (~1920 bytes = 40ms):
-   - Monta chunk binario: [audio_orpheus + frame_wav2lip]
-   - Envia IMEDIATAMENTE para frontend
-   │
-   ▼
-5. Frontend reproduz em tempo real (nao espera tudo chegar)
-```
-## REGRAS CRITICAS
-1. **NAO MODIFICAR O WAV2LIP** - Ele ja gera lip sync com eSpeak interno. So usar os frames JPEG.
-2. **AUDIO VEM DO ORPHEUS** - O audio robotico do Wav2Lip e DESCARTADO. Audio final = Orpheus.
-3. **CONEXOES EM PARALELO** - Orpheus e Wav2Lip devem ser chamados ao mesmo tempo.
-4. **STREAMING IMEDIATO** - Montar e enviar chunks conforme dados chegam, nao esperar tudo.
-5. **SINCRONIZACAO** - Calcular audio_por_frame = total_audio / total_frames para alinhar.
----
-## REGRAS IMPORTANTES
-1. **NAO ALTERAR ARQUIVOS FORA DE `/workspace/interface`** - Este projeto esta isolado no diretorio `/workspace/interface`. Nao modifique, crie ou delete arquivos em outros diretorios do sistema.
-2. **Sempre ler CLAUDE.md antes de fazer alteracoes** - Este arquivo contem a arquitetura e regras do projeto. Consultar antes de modificar qualquer codigo.
-3. **Manter as portas fixas** - TTS=8081, Wav2Lip=8082, Interface=8080. Nunca alterar essas portas sem confirmacao explicita do usuario.
-4. **Nao refatorar sem pedir** - Focar apenas no que foi solicitado. Nao reorganizar codigo, renomear variaveis ou "melhorar" coisas que nao foram pedidas.
-5. **Testar apos cada mudanca** - Apos modificar codigo, verificar se o servidor ainda inicia e responde no `/health`.
-6. **Manter estrutura de arquivos simples** - Apenas `server.py`, `index.html`, `CLAUDE.md`. Nao criar novos arquivos sem aprovacao.
-7. **Commits pequenos e descritivos** - Se for fazer commit, fazer um por funcionalidade, nao acumular varias mudancas.
 ---
-## Arquitetura de Streaming Progressivo
-```
-┌─────────────────┐       ┌──────────────────────┐       ┌─────────────────┐
-│   Frontend      │◄─────►│   Interface Server   │◄─────►│   TTS (Orpheus) │
-│   (Browser)     │  WS   │   (porta 8080)       │  WS   │   (porta 8081)  │
-│                 │       │                      │       │   PCM 24kHz     │
-│   Renderiza     │       │   MONTA CHUNKS:      │       │   (audio final) │
-│   chunks        │       │   audio Orpheus +    │       └─────────────────┘
-│   conforme      │       │   frames Wav2Lip     │
-│   chegam        │       │                      │       ┌─────────────────┐
-└─────────────────┘       │   DESCARTA audio     │◄─────►│   Wav2Lip       │
-                          │   robotico!          │  WS   │   (porta 8082)  │
-                          └──────────────────────┘       │   JPEG 25fps    │
-                                                         │   (so frames!)  │
-                                                         └─────────────────┘
 ```
-## Fluxo de Streaming Progressivo
-```
-1. Frontend envia: {"action": "generate", "text": "Hello", "voice": "tara"}
-   │
-   ▼
-2. Interface conecta ao TTS (8081) e Wav2Lip (8082) EM PARALELO
-   │
-   ├──────────────────────────────────────────────────────────┐
-   │                                                          │
-   ▼                                                          ▼
-3a. Orpheus TTS gera audio                    3b. Wav2Lip gera frames
-    em chunks (~100ms cada)                       baseado em audio interno
-    │                                             (eSpeak robotico - DESCARTAR)
-    │                                             │
-    ▼                                             ▼
-4. Interface Server acumula em buffers:
-   - audio_buffer: chunks do Orpheus
-   - frame_buffer: frames JPEG do Wav2Lip
-   │
-   ▼
-5. Quando tiver dados suficientes (~100ms):
-   a. Pegar audio do Orpheus (ex: 4800 bytes = 100ms)
-   b. Pegar frames do Wav2Lip (ex: 2-3 frames = 80-120ms)
-   c. Montar chunk: [audio_orpheus + frames]
-   d. Enviar para frontend IMEDIATAMENTE
-   │
-   ▼
-6. Frontend recebe chunk e renderiza:
-   - Decodifica audio e adiciona ao buffer de playback
-   - Decodifica frames JPEG e adiciona a fila de exibicao
-   - Inicia playback assim que tiver ~200ms de dados
-   │
-   ▼
-7. Repetir ate receber "done" de ambos os servicos
 ```
-## Protocolos WebSocket
-### TTS Server (porta 8081) - Orpheus
-**Conexao:** `ws://localhost:8081/ws`
-**Requisicao:**
 ```json
 {
-  "action": "synthesize",
-  "text": "Hello world",
-  "voice": "tara",
-  "stream": true
 }
 ```
-**Respostas (streaming):**
 ```json
 {
-  "type": "audio_chunk",
-  "audio": "<base64 PCM>",
-  "chunk_index": 1,
-  "bytes": 4800,
-  "sample_rate": 24000,
-  "channels": 1,
-  "bits_per_sample": 16
 }
 ```
-```json
-{
-  "type": "done",
-  "total_chunks": 5,
-  "total_bytes": 24000
-}
-```
-**Formato do audio:** PCM 24kHz, 16-bit signed little-endian, mono
-**Vozes:** tara, leo, leah, jess, dan, mia, zac, zoe
----
-### Wav2Lip Server (porta 8082)
-**Arquivo:** `/home/marcosavatar/realtimeWav2lip/websocket_server.py`
-**IMPORTANTE:** O Wav2Lip usa audio interno (eSpeak) para gerar lip sync.
-Este audio robotico DEVE SER DESCARTADO - usar APENAS os frames JPEG!
-**Enviar audio (chunk por chunk):**
 ```json
 {
-  "action": "process_audio",
-  "audio": "<base64 PCM>",
-  "sample_rate": 16000
 }
 ```
-**Resposta (frames) - USAR APENAS OS FRAMES:**
-```json
-{"type": "frame", "frame": "<base64 JPEG>", "index": 0}
-{"type": "frame", "frame": "<base64 JPEG>", "index": 1}
-...
-```
-**Finalizar:**
-```json
-{"action": "end"}
-```
-**Resposta final:**
 ```json
 {
-  "type": "done",
-  "frames": 25,
-  "audio_duration_ms": 1000,
-  "total_time_ms": 500
 }
 ```
-**IMPORTANTE:** Wav2Lip espera audio em **16kHz**. O TTS envia em **24kHz**. O server faz resample automaticamente.
----
-### Interface Server (porta 8080) - Frontend
-**Requisicao do frontend:**
-```json
-{"action": "generate", "text": "Hello world", "voice": "tara"}
-{"action": "stop"}
-{"action": "ping"}
-```
-**Respostas para o frontend (STREAMING PROGRESSIVO):**
-```json
-{"type": "status", "message": "Conectando aos servicos..."}
-```
-```json
-{"type": "stream_start", "ttfb_ms": 150}
-```
-```json
-{
-  "type": "chunk",
-  "chunk_index": 1,
-  "audio_size": 4800,
-  "audio_duration_ms": 100,
-  "num_frames": 2,
-  "data": "<base64 do chunk montado>"
-}
-```
 ```json
 {
-  "type": "done",
-  "total_duration_ms": 5000,
-  "total_frames": 125,
-  "total_chunks": 50,
-  "elapsed_ms": 3500
 }
 ```
-```json
-{"type": "error", "message": "Descricao do erro"}
-```
-### Formato do Chunk Montado (binario em base64)
-```
-[audio_size: 4 bytes big-endian]
-[audio_data: PCM 24kHz do ORPHEUS, 16-bit, mono]
-[num_frames: 4 bytes big-endian]
-[frame_1_size: 4 bytes big-endian]
-[frame_1_data: JPEG bytes do WAV2LIP]
-[frame_2_size: 4 bytes big-endian]
-[frame_2_data: JPEG bytes do WAV2LIP]
-...
-```
-## Algoritmo de Montagem de Chunks
-```python
-CHUNK_DURATION_MS = 100  # Tamanho alvo de cada chunk
-AUDIO_SAMPLE_RATE = 24000  # Orpheus
-VIDEO_FPS = 25  # Wav2Lip
-BYTES_PER_SAMPLE = 2  # 16-bit
-# Buffers
-audio_buffer = bytearray()  # Audio do Orpheus
-frame_buffer = []  # Frames do Wav2Lip
-async def process_streaming():
-    # Conectar em PARALELO
-    orpheus_task = asyncio.create_task(connect_orpheus())
-    wav2lip_task = asyncio.create_task(connect_wav2lip())
-    while not done:
-        # Verificar se tem dados suficientes para montar chunk
-        audio_bytes_needed = int(CHUNK_DURATION_MS * AUDIO_SAMPLE_RATE * BYTES_PER_SAMPLE / 1000)
-        frames_needed = int(CHUNK_DURATION_MS * VIDEO_FPS / 1000)
-        if len(audio_buffer) >= audio_bytes_needed and len(frame_buffer) >= frames_needed:
-            # Extrair dados dos buffers
-            audio_chunk = audio_buffer[:audio_bytes_needed]
-            del audio_buffer[:audio_bytes_needed]
-            frames_chunk = frame_buffer[:frames_needed]
-            del frame_buffer[:frames_needed]
-            # Montar e enviar chunk
-            chunk_data = build_chunk(audio_chunk, frames_chunk)
-            await ws.send_json({
-                "type": "chunk",
-                "chunk_index": chunk_index,
-                "audio_duration_ms": CHUNK_DURATION_MS,
-                "num_frames": len(frames_chunk),
-                "data": base64.b64encode(chunk_data).decode()
-            })
-            chunk_index += 1
-```
-## Sincronizacao Audio/Video
-### Calculos
-```
-TTS Orpheus: 24000 Hz (24000 samples/segundo)
-Wav2Lip: 16000 Hz (espera 16000 samples/segundo para lip sync)
-Video: 25 fps (1 frame a cada 40ms)
-Para chunk de 100ms:
-- Audio Orpheus: 100 * 24000 / 1000 * 2 = 4800 bytes
-- Frames Wav2Lip: 100 / 40 = 2.5 ≈ 2-3 frames
 ```
-### Exemplo de Chunk
 ```
-Chunk de 100ms:
-- Audio: 4800 bytes PCM 24kHz do Orpheus (voz de alta qualidade)
-- Frames: 2-3 JPEGs do Wav2Lip (lip sync)
-Total por chunk: ~50-80 KB (depende da compressao JPEG)
-```
-## Estrutura do Projeto
 ```
-/workspace/interface/
-├── CLAUDE.md           # Esta documentacao
-├── server.py           # Servidor WebSocket com montagem de chunks
-├── index.html          # Frontend com playback progressivo
-├── idle.mp4            # Video de idle loop
-└── static/             # Arquivos estaticos
 ```
-## Configuracao
 ```bash
-# Variaveis de ambiente (opcional)
-TTS_WS=ws://localhost:8081/ws
-WAV2LIP_WS=ws://localhost:8082/ws
-PORT=8080
-CHUNK_DURATION_MS=100
 ```
 ## Como Executar
 ```bash
 cd /workspace/interface
 python3 server.py
 ```
 **Output esperado:**
 ```
 ==================================================
-Interface Server - Streaming Progressivo
 ==================================================
 Porta: 8080
-TTS (Orpheus): ws://localhost:8081/ws (24kHz)
-Wav2Lip: ws://localhost:8082/ws (16kHz, apenas frames)
-Chunk duration: 100ms
-Video: 25fps (40ms/frame)
 ==================================================
 ```
-## Vozes Disponiveis (TTS Orpheus)
 | Voice | Genero    |
 |-------|-----------|
@@ -388,26 +247,72 @@ Video: 25fps (40ms/frame)
 | leo   | Masculino |
 | dan   | Masculino |
-## Resumo das Portas
-| Servico          | Porta | Audio        | Video      |
-|------------------|-------|--------------|------------|
-| Interface Server | 8080  | Monta chunks | -          |
-| TTS (Orpheus)    | 8081  | PCM 24kHz    | -          |
-| Wav2Lip          | 8082  | DESCARTAR!   | JPEG 25fps |
-## Dependencias Python
-```
-aiohttp>=3.9.0
-```
-## Notas Importantes
-1. **NUNCA usar audio do Wav2Lip no frontend** - O Wav2Lip usa eSpeak internamente para gerar lip sync. Este audio e robotico e deve ser DESCARTADO. Usar APENAS os frames JPEG.
-2. **SEMPRE usar audio do Orpheus** - O audio final enviado ao frontend deve vir do Orpheus TTS, que gera voz de alta qualidade.
-3. **Streaming progressivo e obrigatorio** - Nao esperar todo o audio/video ficar pronto. Montar e enviar chunks de ~100ms conforme os dados chegam.
-4. **Frontend deve iniciar playback cedo** - Assim que receber ~200ms de dados (2 chunks), iniciar reproducao enquanto continua recebendo.

+# Avatar Interface - WebRTC Streaming com VP9
+## Visao Geral
+Sistema de avatar em tempo real usando WebRTC para streaming de video com baixa latencia.
+O backend faz toda a fusao de video (idle + lip-sync) e envia um stream unificado para o frontend.
+**Framework WebRTC:** [aiortc](https://github.com/aiortc/aiortc)
 ---
+## Arquitetura
+```
+                              FRONTEND (Browser)
+                    ┌─────────────────────────────────────┐
+                    │                                     │
+                    │   <video autoplay>                  │
+                    │                                     │
+                    │   Apenas renderiza o stream         │
+                    │   WebRTC (VP9 + Opus)               │
+                    │                                     │
+                    └──────────────────▲──────────────────┘
+                                       │
+                                       │ WebRTC
+                                       │ (VP9 video + Opus audio)
+                                       │
+═══════════════════════════════════════╧════════════════════════════════════
+                              BACKEND (Python + aiortc)
+┌───────────────────────────────────────────────────────────────────────────┐
+│                                                                           │
+│                        INTERFACE SERVER (8080)                            │
+│                                                                           │
+│  ┌─────────────┐     ┌─────────────┐     ┌─────────────────────────────┐  │
+│  │  idle.mp4   │────►│   MIXER     │────►│   WebRTC Tracks             │  │
+│  │  (frames)   │     │             │     │                             │  │
+│  └─────────────┘     │  Alterna    │     │   AvatarVideoTrack (VP9)    │──┼──► WebRTC
+│                      │  idle/speak │     │   AvatarAudioTrack (Opus)   │  │
+│  ┌─────────────┐     │             │     │                             │  │
+│  │  Wav2Lip    │────►│             │     │   25fps, baixa latencia     │  │
+│  │  (frames)   │     └─────────────┘     └─────────────────────────────┘  │
+│  └─────────────┘                                                          │
+│        │                                                                  │
+│  ┌─────▼───────┐                                                          │
+│  │   Audio     │                                                          │
+│  │   Orpheus   │                                                          │
+│  └─────────────┘                                                          │
+│                                                                           │
+└───────────────────────────────────────────────────────────────────────────┘
+                                       │
+                                       │ WebSocket
+                                       ▼
+                    ┌─────────────────────────────────────┐
+                    │          WAV2LIP SERVER (8082)      │
+                    │                                     │
+                    │  Gera frames de lip-sync            │
+                    │  Chama Orpheus TTS internamente     │
+                    │                                     │
+                    └─────────────────────────────────────┘
 ```
+---
+## Fluxo de Funcionamento
+### 1. Conexao WebRTC
+```
+Cliente                          Servidor
+   │                                │
+   │  POST /offer (SDP offer)       │
+   │ ─────────────────────────────► │
+   │                                │ Cria RTCPeerConnection
+   │                                │ Cria VideoTrack + AudioTrack
+   │                                │
+   │  SDP answer + session_id       │
+   │ ◄───────────────────────────── │
+   │                                │
+   │  WebRTC conectado              │
+   │ ◄════════════════════════════► │ Stream de video comeca
+   │                                │ (idle frames em loop)
+```
+### 2. Geracao de Fala
+```
+Cliente                          Servidor                    Wav2Lip
+   │                                │                           │
+   │  POST /generate                │                           │
+   │  {text, voice, session_id}     │                           │
+   │ ─────────────────────────────► │                           │
+   │                                │                           │
+   │                                │  WS: generate             │
+   │                                │ ────────────────────────► │
+   │                                │                           │
+   │                                │  frames + audio           │
+   │                                │ ◄──────────────────────── │
+   │                                │                           │
+   │  Stream muda para lip-sync     │                           │
+   │ ◄════════════════════════════► │                           │
+   │  (video + audio sincronizado)  │                           │
+   │                                │                           │
+   │  Volta ao idle automaticamente │                           │
+   │ ◄════════════════════════════► │                           │
 ```
+---
+## Endpoints da API
+### POST /offer
+Inicia conexao WebRTC (signaling).
+**Request:**
 ```json
 {
+  "sdp": "v=0\r\no=- ...",
+  "type": "offer"
 }
 ```
+**Response:**
 ```json
 {
+  "sdp": "v=0\r\no=- ...",
+  "type": "answer",
+  "session_id": "uuid-da-sessao"
 }
 ```
+### POST /generate
+Gera fala com lip-sync.
+**Request:**
 ```json
 {
+  "session_id": "uuid-da-sessao",
+  "text": "Hello, I am an avatar!",
+  "voice": "tara"
 }
 ```
+**Response:**
 ```json
 {
+  "status": "generating"
 }
 ```
+### GET /health
+Status do servidor.
+**Response:**
 ```json
 {
+  "status": "ok",
+  "mode": "webrtc",
+  "connections": 2
 }
 ```
+---
+## Configuracao de Codec
+### Video (VP9)
+- **Codec:** libvpx-vp9
+- **FPS:** 25
+- **Latencia:** ~50-100ms
+- **Qualidade:** Alta (compressao temporal)
+### Audio (Opus)
+- **Codec:** Opus
+- **Sample Rate:** 24000 Hz (resampled para 48000 pelo WebRTC)
+- **Canais:** Mono
+- **Modo:** Low delay
+---
+## Estrutura de Arquivos
 ```
+/workspace/interface/
+├── CLAUDE.md           # Esta documentacao
+├── server.py           # Servidor WebRTC com aiortc
+├── index.html          # Frontend WebRTC
+├── idle.mp4            # Video de idle loop
+└── requirements.txt    # Dependencias Python
 ```
+---
+## Dependencias
+### Python
 ```
+aiohttp>=3.9.0
+aiortc>=1.6.0
+opencv-python>=4.8.0
+numpy>=1.24.0
+av>=10.0.0
 ```
+### Sistema (Ubuntu)
 ```bash
+apt install -y libavdevice-dev libavfilter-dev libopus-dev libvpx-dev libsrtp2-dev
 ```
+---
 ## Como Executar
 ```bash
 cd /workspace/interface
+pip install -r requirements.txt
 python3 server.py
 ```
 **Output esperado:**
 ```
 ==================================================
+Interface Server - WebRTC VP9 Streaming
 ==================================================
 Porta: 8080
+Idle Video: /workspace/interface/idle.mp4
+Wav2Lip: ws://localhost:8082/ws
+==================================================
+Endpoints:
+  POST /offer - WebRTC signaling
+  POST /generate - Gerar fala
+==================================================
+Carregando idle frames...
+[Idle] Carregados 1368 frames
 ==================================================
 ```
+---
+## Vozes Disponiveis (Orpheus TTS)
 | Voice | Genero    |
 |-------|-----------|
 | leo   | Masculino |
 | dan   | Masculino |
+---
+## Portas
+| Servico          | Porta |
+|------------------|-------|
+| Interface Server | 8080  |
+| Orpheus TTS      | 8081  |
+| Wav2Lip          | 8082  |
+---
+## Vantagens do WebRTC sobre WebSocket+JPEG
+| Aspecto         | WebSocket+JPEG    | WebRTC+VP9         |
+|-----------------|-------------------|---------------------|
+| Bandwidth       | ~1.25 MB/s        | ~200 KB/s (6x menos)|
+| Latencia        | ~50ms             | ~50-100ms           |
+| CPU Browser     | Alta (JS decode)  | Baixa (GPU decode)  |
+| Audio/Video     | Separados         | Sincronizados       |
+| Qualidade       | Boa               | Excelente           |
+---
+## Frontend Simplificado
+O frontend apenas:
+1. Envia offer SDP
+2. Recebe answer SDP
+3. Renderiza `<video>`
+4. Envia texto para /generate
+Toda a logica de fusao, encoding e timing esta no backend.
+---
+## Fixes Importantes
+### Audio Pop/Click no Inicio (WAV Header)
+**Problema:** O audio do Orpheus TTS vem com um header WAV de 44 bytes. Quando o frontend interpreta esses bytes como dados PCM, causa um ruido/estalo no inicio da reproducao.
+**Solucao:** Detectar o header WAV (bytes `RIFF`) e pular os primeiros 44 bytes antes de processar o PCM:
+```javascript
+// Verificar se tem header WAV (RIFF) e pular se existir
+let pcmOffset = 0;
+if (bytes.length > 44 &&
+    bytes[0] === 0x52 && bytes[1] === 0x49 &&
+    bytes[2] === 0x46 && bytes[3] === 0x46) {  // "RIFF"
+    console.log('WAV header detected, skipping 44 bytes');
+    pcmOffset = 44;
+}
+const pcmData = new Int16Array(bytes.buffer, pcmOffset);
+```
+**Adicional:** Aplicar fade-in/fade-out suave para evitar qualquer descontinuidade restante:
+- Fade-in: 50ms com curva quadratica
+- Fade-out: 30ms linear
+---
+## Regras Importantes
+1. **NAO ALTERAR ARQUIVOS FORA DE `/workspace/interface`**
+2. **Backend faz toda a fusao** - Frontend so renderiza
+3. **Manter portas fixas** - 8080, 8081, 8082
+4. **Testar apos cada mudanca** - Verificar /health

interface/index_streaming.html CHANGED Viewed

@@ -235,13 +235,37 @@ async function startSyncedPlayback(base64Audio, durationMs) {
             bytes[i] = binaryString.charCodeAt(i);
         }
         // PCM 16-bit mono 24kHz -> AudioBuffer
-        const pcmData = new Int16Array(bytes.buffer);
         const floatData = new Float32Array(pcmData.length);
         for (let i = 0; i < pcmData.length; i++) {
             floatData[i] = pcmData[i] / 32768.0;
         }
         audioBuffer = audioContext.createBuffer(1, floatData.length, 24000);
         audioBuffer.getChannelData(0).set(floatData);

             bytes[i] = binaryString.charCodeAt(i);
         }
+        // Verificar se tem header WAV (RIFF) e pular se existir
+        let pcmOffset = 0;
+        if (bytes.length > 44 &&
+            bytes[0] === 0x52 && bytes[1] === 0x49 &&
+            bytes[2] === 0x46 && bytes[3] === 0x46) {  // "RIFF"
+            console.log('WAV header detected, skipping 44 bytes');
+            pcmOffset = 44;
+        }
         // PCM 16-bit mono 24kHz -> AudioBuffer
+        const pcmData = new Int16Array(bytes.buffer, pcmOffset);
         const floatData = new Float32Array(pcmData.length);
         for (let i = 0; i < pcmData.length; i++) {
             floatData[i] = pcmData[i] / 32768.0;
         }
+        // Aplicar fade-in suave para evitar estalo no inicio (50ms @ 24kHz = 1200 samples)
+        const fadeInSamples = 1200;
+        for (let i = 0; i < Math.min(fadeInSamples, floatData.length); i++) {
+            // Usar curva exponencial para fade mais suave
+            const t = i / fadeInSamples;
+            floatData[i] *= t * t;  // Curva quadratica (mais suave que linear)
+        }
+        // Aplicar fade-out suave para evitar estalo no fim (30ms @ 24kHz = 720 samples)
+        const fadeOutSamples = 720;
+        const fadeOutStart = floatData.length - fadeOutSamples;
+        for (let i = 0; i < fadeOutSamples && fadeOutStart + i < floatData.length; i++) {
+            floatData[fadeOutStart + i] *= (fadeOutSamples - i) / fadeOutSamples;
+        }
         audioBuffer = audioContext.createBuffer(1, floatData.length, 24000);
         audioBuffer.getChannelData(0).set(floatData);