| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - speaker-verification |
| - speaker-recognition |
| - ecapa-tdnn |
| - quantization |
| - qat |
| - mixed-precision |
| - edge |
| datasets: |
| - voxceleb2 |
| metrics: |
| - eer |
| --- |
| |
| # ECAPA-QAT |
|
|
|  |
|
|
| **Quantization-Aware Trained ECAPA-TDNN for Speaker Verification** |
|
|
| A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves **2.61% EER** on VoxCeleb1-O while fitting in **4 MB** on disk and **7.6 MB** in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment. |
|
|
| --- |
|
|
| ## Highlights |
|
|
| | | FP32 baseline | **ECAPA-QAT** | |
| |---|---|---| |
| | EER (VoxCeleb1-O) | 3.05 % | **2.61 %** | |
| | File size | 28.2 MB | **4.0 MB** | |
| | RAM (weights) | 80 MB | **7.6 MB** | |
| | CPU latency (3 s audio) | — | **66 ms** | |
| | Parallel sessions / 1 GB RAM | 23 | **61–87** | |
|
|
| > The quantized model **outperforms** its FP32 counterpart — quantization-aware training acts as an implicit regularizer. |
|
|
| --- |
|
|
| ## Architecture |
|
|
| ECAPA-QAT is based on [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) (Desplanques et al., Interspeech 2020) with C=512 channels. |
|
|
| ``` |
| Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz) |
| │ |
| ├─ block0 Conv1D + ReLU + BN (k=5) → INT8 |
| ├─ block1 SE-Res2Block (k=3, d=2) → INT8 |
| ├─ block2 SE-Res2Block (k=3, d=3) → INT4 |
| ├─ block3 SE-Res2Block (k=3, d=4) → INT4 |
| ├─ mfa Conv1D + ReLU (k=1, MFA) → INT4 |
| ├─ asp Attentive Stat Pooling + BN → INT8 |
| └─ fn FC + BN → INT4 |
| |
| Output: 192-dim L2-normalized speaker embedding |
| ``` |
|
|
| Mixed precision assignment is based on per-block sensitivity analysis: |
| blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4. |
|
|
| --- |
|
|
| ## Training |
|
|
| ### Teacher pretraining |
| - Dataset: VoxCeleb2 (5 994 speakers) |
| - Loss: ArcFace (s=64, m=0.2 rad) |
| - Output: FP32 teacher model |
|
|
| ### QAT with cosine distillation |
| The student (quantized) model is trained to reproduce the FP32 teacher embeddings: |
|
|
| ``` |
| L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat) |
| ``` |
|
|
| Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE): |
|
|
| ``` |
| FQ(w) = s × clamp(round(w / s), −8, 7) |
| s = max(|w_j|) / 7 # per-group scale, G = 128 |
| ``` |
|
|
| ### Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs |
|
|
| | Phase | Epochs | Active QAT layers | lr | |
| |---|---|---|---| |
| | 1 | 15 | asp_bn, fn (2 / 69) | 1e-4 | |
| | 2 | 15 | + block2, block3 (42 / 69) | 1e-4 | |
| | 3 | 20 | + block1, mfa, asp (67 / 69) | 6e-4 | |
| | 4 | 20 | + block0 (all 69) | 4e-4 | |
| | 5 | 15 | all 69 — fine-tune | 1e-5 | |
| |
| Layers are activated from least sensitive to most sensitive. |
| BN statistics are frozen (eval mode) in all phases. |
| |
| --- |
| |
| ## Quick Start |
| |
| ### Requirements |
| |
| ```bash |
| pip install torch torchaudio torchao |
| ``` |
| |
| ### Load the model |
| |
| ```python |
| import torch |
| import torchaudio |
| from model import EcapaTdnn # your ECAPA-TDNN definition |
| |
| # Load packed INT4/INT8 weights |
| model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False) |
| model.eval() |
| |
| # Extract embedding |
| wav, sr = torchaudio.load("audio.wav") |
| if sr != 16000: |
| wav = torchaudio.functional.resample(wav, sr, 16000) |
| |
| with torch.no_grad(): |
| embedding = model(wav) # shape: [1, 192] |
| embedding = torch.nn.functional.normalize(embedding, dim=-1) |
| ``` |
| |
| ### Speaker verification |
|
|
| ```python |
| import torch.nn.functional as F |
| |
| emb_a = model(wav_a) |
| emb_b = model(wav_b) |
| |
| score = F.cosine_similarity(emb_a, emb_b).item() |
| decision = "ACCEPT" if score > 0.25 else "REJECT" |
| print(f"Score: {score:.4f} → {decision}") |
| ``` |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Evaluated on **VoxCeleb1-O** (original trial list, 7 097 pairs, 10 speakers). |
|
|
| ``` |
| EER = 2.61 % |
| ``` |
|
|
| To reproduce: |
|
|
| ```bash |
| python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt |
| ``` |
|
|
| --- |
|
|
| ## Model Files |
|
|
| | File | Description | Size | |
| |---|---|---| |
| | `ecapa_teacher_qat_w4a4_phase5_best.pt` | Training checkpoint (phase 5 best) | ~28 MB (FP32 layout) | |
| | `ecapa_qat_packed.pt` | Inference-ready packed INT4/INT8 weights | **4 MB** | |
|
|
| Use `ecapa_qat_packed.pt` for inference. The checkpoint file is provided for reproducibility and further fine-tuning. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper: |
|
|
| ```bibtex |
| @inproceedings{desplanques2020ecapa, |
| title = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification}, |
| author = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris}, |
| booktitle = {Proc. Interspeech}, |
| year = {2020} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| MIT |
|
|