--- license: mit language: - en tags: - speaker-verification - speaker-recognition - ecapa-tdnn - quantization - qat - mixed-precision - edge datasets: - voxceleb2 metrics: - eer --- # ECAPA-QAT ![ECAPA-QAT banner](banner.svg) **Quantization-Aware Trained ECAPA-TDNN for Speaker Verification** A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves **2.61% EER** on VoxCeleb1-O while fitting in **4 MB** on disk and **7.6 MB** in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment. --- ## Highlights | | FP32 baseline | **ECAPA-QAT** | |---|---|---| | EER (VoxCeleb1-O) | 3.05 % | **2.61 %** | | File size | 28.2 MB | **4.0 MB** | | RAM (weights) | 80 MB | **7.6 MB** | | CPU latency (3 s audio) | — | **66 ms** | | Parallel sessions / 1 GB RAM | 23 | **61–87** | > The quantized model **outperforms** its FP32 counterpart — quantization-aware training acts as an implicit regularizer. --- ## Architecture ECAPA-QAT is based on [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) (Desplanques et al., Interspeech 2020) with C=512 channels. ``` Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz) │ ├─ block0 Conv1D + ReLU + BN (k=5) → INT8 ├─ block1 SE-Res2Block (k=3, d=2) → INT8 ├─ block2 SE-Res2Block (k=3, d=3) → INT4 ├─ block3 SE-Res2Block (k=3, d=4) → INT4 ├─ mfa Conv1D + ReLU (k=1, MFA) → INT4 ├─ asp Attentive Stat Pooling + BN → INT8 └─ fn FC + BN → INT4 Output: 192-dim L2-normalized speaker embedding ``` Mixed precision assignment is based on per-block sensitivity analysis: blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4. --- ## Training ### Teacher pretraining - Dataset: VoxCeleb2 (5 994 speakers) - Loss: ArcFace (s=64, m=0.2 rad) - Output: FP32 teacher model ### QAT with cosine distillation The student (quantized) model is trained to reproduce the FP32 teacher embeddings: ``` L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat) ``` Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE): ``` FQ(w) = s × clamp(round(w / s), −8, 7) s = max(|w_j|) / 7 # per-group scale, G = 128 ``` ### Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs | Phase | Epochs | Active QAT layers | lr | |---|---|---|---| | 1 | 15 | asp_bn, fn (2 / 69) | 1e-4 | | 2 | 15 | + block2, block3 (42 / 69) | 1e-4 | | 3 | 20 | + block1, mfa, asp (67 / 69) | 6e-4 | | 4 | 20 | + block0 (all 69) | 4e-4 | | 5 | 15 | all 69 — fine-tune | 1e-5 | Layers are activated from least sensitive to most sensitive. BN statistics are frozen (eval mode) in all phases. --- ## Quick Start ### Requirements ```bash pip install torch torchaudio torchao ``` ### Load the model ```python import torch import torchaudio from model import EcapaTdnn # your ECAPA-TDNN definition # Load packed INT4/INT8 weights model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False) model.eval() # Extract embedding wav, sr = torchaudio.load("audio.wav") if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000) with torch.no_grad(): embedding = model(wav) # shape: [1, 192] embedding = torch.nn.functional.normalize(embedding, dim=-1) ``` ### Speaker verification ```python import torch.nn.functional as F emb_a = model(wav_a) emb_b = model(wav_b) score = F.cosine_similarity(emb_a, emb_b).item() decision = "ACCEPT" if score > 0.25 else "REJECT" print(f"Score: {score:.4f} → {decision}") ``` --- ## Evaluation Evaluated on **VoxCeleb1-O** (original trial list, 7 097 pairs, 10 speakers). ``` EER = 2.61 % ``` To reproduce: ```bash python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt ``` --- ## Model Files | File | Description | Size | |---|---|---| | `ecapa_teacher_qat_w4a4_phase5_best.pt` | Training checkpoint (phase 5 best) | ~28 MB (FP32 layout) | | `ecapa_qat_packed.pt` | Inference-ready packed INT4/INT8 weights | **4 MB** | Use `ecapa_qat_packed.pt` for inference. The checkpoint file is provided for reproducibility and further fine-tuning. --- ## Citation If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper: ```bibtex @inproceedings{desplanques2020ecapa, title = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification}, author = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris}, booktitle = {Proc. Interspeech}, year = {2020} } ``` --- ## License MIT