---
license: mit
language:
- en
tags:
- speaker-verification
- speaker-recognition
- ecapa-tdnn
- quantization
- qat
- mixed-precision
- edge
datasets:
- voxceleb2
metrics:
- eer
---

# ECAPA-QAT

![ECAPA-QAT banner](banner.svg)

**Quantization-Aware Trained ECAPA-TDNN for Speaker Verification**

A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves **2.61% EER** on VoxCeleb1-O while fitting in **4 MB** on disk and **7.6 MB** in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment.

---

## Highlights

| | FP32 baseline | **ECAPA-QAT** |
|---|---|---|
| EER (VoxCeleb1-O) | 3.05 % | **2.61 %** |
| File size | 28.2 MB | **4.0 MB** |
| RAM (weights) | 80 MB | **7.6 MB** |
| CPU latency (3 s audio) | — | **66 ms** |
| Parallel sessions / 1 GB RAM | 23 | **61–87** |

> The quantized model **outperforms** its FP32 counterpart — quantization-aware training acts as an implicit regularizer.

---

## Architecture

ECAPA-QAT is based on [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) (Desplanques et al., Interspeech 2020) with C=512 channels.

```
Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz)
  │
  ├─ block0   Conv1D + ReLU + BN  (k=5)          →  INT8
  ├─ block1   SE-Res2Block        (k=3, d=2)     →  INT8
  ├─ block2   SE-Res2Block        (k=3, d=3)     →  INT4
  ├─ block3   SE-Res2Block        (k=3, d=4)     →  INT4
  ├─ mfa      Conv1D + ReLU       (k=1, MFA)     →  INT4
  ├─ asp      Attentive Stat Pooling + BN        →  INT8
  └─ fn       FC + BN                            →  INT4

Output: 192-dim L2-normalized speaker embedding
```

Mixed precision assignment is based on per-block sensitivity analysis:
blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4.

---

## Training

### Teacher pretraining
- Dataset: VoxCeleb2 (5 994 speakers)
- Loss: ArcFace (s=64, m=0.2 rad)
- Output: FP32 teacher model

### QAT with cosine distillation
The student (quantized) model is trained to reproduce the FP32 teacher embeddings:

```
L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat)
```

Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE):

```
FQ(w) = s × clamp(round(w / s), −8, 7)
s     = max(|w_j|) / 7        # per-group scale, G = 128
```

### Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs

| Phase | Epochs | Active QAT layers | lr |
|---|---|---|---|
| 1 | 15 | asp_bn, fn (2 / 69) | 1e-4 |
| 2 | 15 | + block2, block3 (42 / 69) | 1e-4 |
| 3 | 20 | + block1, mfa, asp (67 / 69) | 6e-4 |
| 4 | 20 | + block0 (all 69) | 4e-4 |
| 5 | 15 | all 69 — fine-tune | 1e-5 |

Layers are activated from least sensitive to most sensitive.
BN statistics are frozen (eval mode) in all phases.

---

## Quick Start

### Requirements

```bash
pip install torch torchaudio torchao
```

### Load the model

```python
import torch
import torchaudio
from model import EcapaTdnn  # your ECAPA-TDNN definition

# Load packed INT4/INT8 weights
model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False)
model.eval()

# Extract embedding
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

with torch.no_grad():
    embedding = model(wav)          # shape: [1, 192]
    embedding = torch.nn.functional.normalize(embedding, dim=-1)
```

### Speaker verification

```python
import torch.nn.functional as F

emb_a = model(wav_a)
emb_b = model(wav_b)

score = F.cosine_similarity(emb_a, emb_b).item()
decision = "ACCEPT" if score > 0.25 else "REJECT"
print(f"Score: {score:.4f} → {decision}")
```

---

## Evaluation

Evaluated on **VoxCeleb1-O** (original trial list, 7 097 pairs, 10 speakers).

```
EER  = 2.61 %
```

To reproduce:

```bash
python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt
```

---

## Model Files

| File | Description | Size |
|---|---|---|
| `ecapa_teacher_qat_w4a4_phase5_best.pt` | Training checkpoint (phase 5 best) | ~28 MB (FP32 layout) |
| `ecapa_qat_packed.pt` | Inference-ready packed INT4/INT8 weights | **4 MB** |

Use `ecapa_qat_packed.pt` for inference. The checkpoint file is provided for reproducibility and further fine-tuning.

---

## Citation

If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper:

```bibtex
@inproceedings{desplanques2020ecapa,
  title     = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
  author    = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle = {Proc. Interspeech},
  year      = {2020}
}
```

---

## License

MIT