---
tags:
- tts
- voice-conversion
- speech-synthesis
- speaker-embedding
- speaker-proxy
- ecapa-tdnn
- qwen3-tts
license: mit
---

# Speaker Proxy Network (RVQ → Speaker Embedding)

A lightweight differentiable surrogate that maps **Qwen3-TTS RVQ embeddings** directly to **speaker embeddings**, bypassing the expensive audio-decoding → feature-extraction pipeline during voice-conversion training.

> ⚠️ **Note:** This repository contains **only the Speaker Proxy**. The full RVQ proxy (speaker + wav2vec + mel) is a separate effort. This checkpoint is the standalone speaker branch, trained with a pure contrastive objective on real speaker labels.

---

## Why a Speaker Proxy?

During voice-conversion training, the standard pipeline is:

```
model logits → argmax → RVQ tokens → decoder → waveform → ECAPA-TDNN → speaker embedding
```

This pipeline is **non-differentiable** because of `argmax` and the audio decoder. The Speaker Proxy replaces it with:

```
model logits → softmax → RVQ sum embedding → SpeakerProxyECAPA → L2-normalized speaker embedding
```

Everything after `softmax` is now differentiable, enabling end-to-end backpropagation through the entire voice-conversion objective.

---

## Architecture

**SpeakerProxyECAPA** — an ECAPA-TDNN-style network adapted for RVQ-sum inputs.

| Component | Details |
|-----------|---------|
| Input | `[B, T, 2048]` RVQ sum embedding (sum of 16 learned codebook embeddings) |
| Front-end | Conv1d projection + SE-Res2Blocks (dilations 2, 3, 4) |
| Pooling | Attentive Statistics Pooling (mean + std, attention-weighted) |
| Bottleneck | FC → 192-dim |
| Output | L2-normalized 192-dim speaker embedding |
| **Parameters** | **~4.6M** |

The architecture mirrors the original SpeechBrain ECAPA-TDNN but is trained end-to-end on RVQ inputs rather than raw audio spectrograms.

---

## Training

| Detail | Value |
|--------|-------|
| Dataset | `lonesamurai/emilia_clean_10k` (10,000 clips, 200 speakers) |
| Train / Val split | 8,000 / 2,000 clips |
| Epochs | ~200 |
| Loss | Pure contrastive — `(1−cos)²` alignment + `λ·ReLU(cos−margin)²` repulsion |
| λ (repel) | 5.0 |
| Optimizer | AdamW, lr = 1e-4, weight_decay = 1e-5 |
| Best val separation | **0.8141** |

### Validation performance (contrastive separation metric)

- **Best checkpoint:** epoch ~140, separation = **0.8141**
- **Final checkpoint:** epoch ~197, separation ≈ 0.73 (plateaued)

---

## Comparison with Original ECAPA-TDNN

Tested on 5 seen + 5 unseen speakers from EMILIA:

| Metric | SpeakerProxy (Ours) | Original ECAPA-TDNN |
|--------|---------------------|---------------------|
| Seen-Seen off-diag mean | **0.050** | 0.094 |
| Unseen-Unseen off-diag mean | **−0.026** | 0.060 |
| Seen-Unseen off-diag mean | **−0.026** | 0.033 |
| **All off-diag mean** | **−0.009** | 0.053 |
| Off-diag std | 0.156 | **0.098** |
| Worst confusion (max) | 0.420 | **0.327** |
| Per-speaker separation (seen avg) | **0.992** | 0.940 |
| Per-speaker separation (unseen avg) | **1.024** | 0.955 |

**Takeaway:** Our proxy achieves **stronger average separation** than the original audio-based ECAPA, especially on **unseen speakers** (negative mean similarity vs. positive). The trade-off is slightly higher variance — a few outlier pairs show stronger confusion, but the vast majority of speaker pairs are pushed farther apart.

---

## Checkpoints

| File | Description |
|------|-------------|
| `speaker_proxy_10k_best.pt` | **Best checkpoint** (val separation = 0.8141, ~epoch 140) |

The checkpoint contains:
- `model_state_dict`: full network weights
- `config`: architecture hyperparameters
- `epoch`: training epoch at save time
- `val_separation`: best validation metric

---

## Usage

```python
import torch
from exiv.components.models.qwen3_tts.sern.speaker_proxy_ecapa import SpeakerProxyECAPA

# Load checkpoint
checkpoint = torch.load("speaker_proxy_10k_best.pt", map_location="cpu")
config = checkpoint["config"]

# Build model
proxy = SpeakerProxyECAPA(
    input_dim=config["input_dim"],      # 2048
    embed_dim=config["embed_dim"],      # 192
    channels=config["channels"],        # 512
    num_blocks=config["num_blocks"],    # 3
)
proxy.load_state_dict(checkpoint["model_state_dict"])
proxy.eval().cuda()

# Forward pass — E_rvq is the sum of 16 RVQ embedding tables
# E_rvq: [B, T, 2048] from Qwen3-TTS RVQ tokens
speaker_embedding = proxy(E_rvq)  # [B, 192], L2-normalized
```

### Computing RVQ sum embeddings from Qwen3-TTS tokens

```python
# Extract the 16 embedding tables from Qwen3-TTS
embedding_tables = [
    model.model.embed_tokens[i].weight for i in range(16)
]

# tokens: [B, T, 16] integer RVQ indices
E_rvq = torch.stack([
    embedding_tables[i][tokens[..., i]] for i in range(16)
], dim=-1).sum(dim=-1)  # [B, T, 2048]
```

---

## Requirements

- PyTorch ≥ 2.0
- See [Exiv](https://github.com/piyushK52/Exiv) for full integration with Qwen3-TTS SERN adapter

---

## License

MIT