lonesamurai
/

rvq_proxy_network

@@ -3,66 +3,155 @@ tags:
 - tts
 - voice-conversion
 - speech-synthesis
-- differentiable-proxy
 - qwen3-tts
 license: mit
 ---
-# RVQ Proxy Network
-A lightweight differentiable surrogate network that maps from Qwen3-TTS RVQ embedding space to high-level perceptual audio features: speaker embedding, wav2vec2 content features, and mel spectrogram.
-## Purpose
-During voice conversion training, the standard pipeline (logits → argmax → RVQ tokens → decoder → waveform → feature extractors) is non-differentiable. The RVQ Proxy replaces this with a tiny differentiable network, enabling end-to-end training without audio decoding.
 ```
-model logits → softmax → E_soft → RVQProxy → speaker / wav2vec / mel
 ```
 ## Architecture
-- **Shared temporal encoder:** 3-layer 1D conv (receptive field ~560ms) with GroupNorm + GELU
-- **Speaker head:** 2-layer MLP + mean pooling → 2048-dim speaker embedding
-- **Wav2vec head:** Single linear projection → 768-dim features
-- **Mel head:** 2-layer MLP → 80-bin mel spectrogram
-**Parameters:** ~6.7M
 ## Checkpoints
 | File | Description |
 |------|-------------|
-| `rvq_proxy_10k.pt` | Best checkpoint (val speaker cosine = 0.9925) |
-| `rvq_proxy_10k_final.pt` | Final epoch checkpoint (epoch 20) |
-Both checkpoints include metadata (`input_dim`, `num_speaker_dims`) for easy loading.
 ## Usage
 ```python
-from exiv.components.models.qwen3_tts.sern.rvq_proxy import RVQProxy
 import torch
-ckpt = torch.load("rvq_proxy_10k.pt", map_location="cpu")
-proxy = RVQProxy(
-    input_dim=ckpt["input_dim"],
-    num_speaker_dims=ckpt["num_speaker_dims"]
 )
-proxy.load_state_dict(ckpt["proxy_state"])
 proxy.eval().cuda()
-# Forward pass
-out = proxy(E_soft, mask=mask)  # E_soft: [B, T, 512]
-speaker = out["speaker"]        # [B, 2048]
-wav2vec = out["wav2vec"]        # [B, T, 768]
-mel = out["mel"]                # [B, T, 80]
 ```
 ## Requirements
 - PyTorch ≥ 2.0
-- See [Exiv](https://github.com/piyushK52/Exiv) for full integration with Qwen3-TTS
 ## License

 - tts
 - voice-conversion
 - speech-synthesis
+- speaker-embedding
+- speaker-proxy
+- ecapa-tdnn
 - qwen3-tts
 license: mit
 ---
+# Speaker Proxy Network (RVQ → Speaker Embedding)
+A lightweight differentiable surrogate that maps **Qwen3-TTS RVQ embeddings** directly to **speaker embeddings**, bypassing the expensive audio-decoding → feature-extraction pipeline during voice-conversion training.
+> ⚠️ **Note:** This repository contains **only the Speaker Proxy**. The full RVQ proxy (speaker + wav2vec + mel) is a separate effort. This checkpoint is the standalone speaker branch, trained with a pure contrastive objective on real speaker labels.
+---
+## Why a Speaker Proxy?
+During voice-conversion training, the standard pipeline is:
 ```
+model logits → argmax → RVQ tokens → decoder → waveform → ECAPA-TDNN → speaker embedding
 ```
+This pipeline is **non-differentiable** because of `argmax` and the audio decoder. The Speaker Proxy replaces it with:
+```
+model logits → softmax → RVQ sum embedding → SpeakerProxyECAPA → L2-normalized speaker embedding
+```
+Everything after `softmax` is now differentiable, enabling end-to-end backpropagation through the entire voice-conversion objective.
+---
 ## Architecture
+**SpeakerProxyECAPA** — an ECAPA-TDNN-style network adapted for RVQ-sum inputs.
+| Component | Details |
+|-----------|---------|
+| Input | `[B, T, 2048]` RVQ sum embedding (sum of 16 learned codebook embeddings) |
+| Front-end | Conv1d projection + SE-Res2Blocks (dilations 2, 3, 4) |
+| Pooling | Attentive Statistics Pooling (mean + std, attention-weighted) |
+| Bottleneck | FC → 192-dim |
+| Output | L2-normalized 192-dim speaker embedding |
+| **Parameters** | **~4.6M** |
+The architecture mirrors the original SpeechBrain ECAPA-TDNN but is trained end-to-end on RVQ inputs rather than raw audio spectrograms.
+---
+## Training
+| Detail | Value |
+|--------|-------|
+| Dataset | `lonesamurai/emilia_clean_10k` (10,000 clips, 200 speakers) |
+| Train / Val split | 8,000 / 2,000 clips |
+| Epochs | ~200 |
+| Loss | Pure contrastive — `(1−cos)²` alignment + `λ·ReLU(cos−margin)²` repulsion |
+| λ (repel) | 5.0 |
+| Optimizer | AdamW, lr = 1e-4, weight_decay = 1e-5 |
+| Best val separation | **0.8141** |
+### Validation performance (contrastive separation metric)
+- **Best checkpoint:** epoch ~140, separation = **0.8141**
+- **Final checkpoint:** epoch ~197, separation ≈ 0.73 (plateaued)
+---
+## Comparison with Original ECAPA-TDNN
+Tested on 5 seen + 5 unseen speakers from EMILIA:
+| Metric | SpeakerProxy (Ours) | Original ECAPA-TDNN |
+|--------|---------------------|---------------------|
+| Seen-Seen off-diag mean | **0.050** | 0.094 |
+| Unseen-Unseen off-diag mean | **−0.026** | 0.060 |
+| Seen-Unseen off-diag mean | **−0.026** | 0.033 |
+| **All off-diag mean** | **−0.009** | 0.053 |
+| Off-diag std | 0.156 | **0.098** |
+| Worst confusion (max) | 0.420 | **0.327** |
+| Per-speaker separation (seen avg) | **0.992** | 0.940 |
+| Per-speaker separation (unseen avg) | **1.024** | 0.955 |
+**Takeaway:** Our proxy achieves **stronger average separation** than the original audio-based ECAPA, especially on **unseen speakers** (negative mean similarity vs. positive). The trade-off is slightly higher variance — a few outlier pairs show stronger confusion, but the vast majority of speaker pairs are pushed farther apart.
+---
 ## Checkpoints
 | File | Description |
 |------|-------------|
+| `speaker_proxy_10k_best.pt` | **Best checkpoint** (val separation = 0.8141, ~epoch 140) |
+The checkpoint contains:
+- `model_state_dict`: full network weights
+- `config`: architecture hyperparameters
+- `epoch`: training epoch at save time
+- `val_separation`: best validation metric
+---
 ## Usage
 ```python
 import torch
+from exiv.components.models.qwen3_tts.sern.speaker_proxy_ecapa import SpeakerProxyECAPA
+# Load checkpoint
+checkpoint = torch.load("speaker_proxy_10k_best.pt", map_location="cpu")
+config = checkpoint["config"]
+# Build model
+proxy = SpeakerProxyECAPA(
+    input_dim=config["input_dim"],      # 2048
+    embed_dim=config["embed_dim"],      # 192
+    channels=config["channels"],        # 512
+    num_blocks=config["num_blocks"],    # 3
 )
+proxy.load_state_dict(checkpoint["model_state_dict"])
 proxy.eval().cuda()
+# Forward pass — E_rvq is the sum of 16 RVQ embedding tables
+# E_rvq: [B, T, 2048] from Qwen3-TTS RVQ tokens
+speaker_embedding = proxy(E_rvq)  # [B, 192], L2-normalized
+```
+### Computing RVQ sum embeddings from Qwen3-TTS tokens
+```python
+# Extract the 16 embedding tables from Qwen3-TTS
+embedding_tables = [
+    model.model.embed_tokens[i].weight for i in range(16)
+]
+# tokens: [B, T, 16] integer RVQ indices
+E_rvq = torch.stack([
+    embedding_tables[i][tokens[..., i]] for i in range(16)
+], dim=-1).sum(dim=-1)  # [B, T, 2048]
 ```
+---
 ## Requirements
 - PyTorch ≥ 2.0
+- See [Exiv](https://github.com/piyushK52/Exiv) for full integration with Qwen3-TTS SERN adapter
+---
 ## License