ECAPA-QAT
Quantization-Aware Trained ECAPA-TDNN for Speaker Verification
A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves 2.61% EER on VoxCeleb1-O while fitting in 4 MB on disk and 7.6 MB in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment.
Highlights
| FP32 baseline | ECAPA-QAT | |
|---|---|---|
| EER (VoxCeleb1-O) | 3.05 % | 2.61 % |
| File size | 28.2 MB | 4.0 MB |
| RAM (weights) | 80 MB | 7.6 MB |
| CPU latency (3 s audio) | — | 66 ms |
| Parallel sessions / 1 GB RAM | 23 | 61–87 |
The quantized model outperforms its FP32 counterpart — quantization-aware training acts as an implicit regularizer.
Architecture
ECAPA-QAT is based on ECAPA-TDNN (Desplanques et al., Interspeech 2020) with C=512 channels.
Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz)
│
├─ block0 Conv1D + ReLU + BN (k=5) → INT8
├─ block1 SE-Res2Block (k=3, d=2) → INT8
├─ block2 SE-Res2Block (k=3, d=3) → INT4
├─ block3 SE-Res2Block (k=3, d=4) → INT4
├─ mfa Conv1D + ReLU (k=1, MFA) → INT4
├─ asp Attentive Stat Pooling + BN → INT8
└─ fn FC + BN → INT4
Output: 192-dim L2-normalized speaker embedding
Mixed precision assignment is based on per-block sensitivity analysis: blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4.
Training
Teacher pretraining
- Dataset: VoxCeleb2 (5 994 speakers)
- Loss: ArcFace (s=64, m=0.2 rad)
- Output: FP32 teacher model
QAT with cosine distillation
The student (quantized) model is trained to reproduce the FP32 teacher embeddings:
L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat)
Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE):
FQ(w) = s × clamp(round(w / s), −8, 7)
s = max(|w_j|) / 7 # per-group scale, G = 128
Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs
| Phase | Epochs | Active QAT layers | lr |
|---|---|---|---|
| 1 | 15 | asp_bn, fn (2 / 69) | 1e-4 |
| 2 | 15 | + block2, block3 (42 / 69) | 1e-4 |
| 3 | 20 | + block1, mfa, asp (67 / 69) | 6e-4 |
| 4 | 20 | + block0 (all 69) | 4e-4 |
| 5 | 15 | all 69 — fine-tune | 1e-5 |
Layers are activated from least sensitive to most sensitive. BN statistics are frozen (eval mode) in all phases.
Quick Start
Requirements
pip install torch torchaudio torchao
Load the model
import torch
import torchaudio
from model import EcapaTdnn # your ECAPA-TDNN definition
# Load packed INT4/INT8 weights
model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False)
model.eval()
# Extract embedding
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
with torch.no_grad():
embedding = model(wav) # shape: [1, 192]
embedding = torch.nn.functional.normalize(embedding, dim=-1)
Speaker verification
import torch.nn.functional as F
emb_a = model(wav_a)
emb_b = model(wav_b)
score = F.cosine_similarity(emb_a, emb_b).item()
decision = "ACCEPT" if score > 0.25 else "REJECT"
print(f"Score: {score:.4f} → {decision}")
Evaluation
Evaluated on VoxCeleb1-O (original trial list, 7 097 pairs, 10 speakers).
EER = 2.61 %
To reproduce:
python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt
Model Files
| File | Description | Size |
|---|---|---|
ecapa_teacher_qat_w4a4_phase5_best.pt |
Training checkpoint (phase 5 best) | ~28 MB (FP32 layout) |
ecapa_qat_packed.pt |
Inference-ready packed INT4/INT8 weights | 4 MB |
Use ecapa_qat_packed.pt for inference. The checkpoint file is provided for reproducibility and further fine-tuning.
Citation
If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper:
@inproceedings{desplanques2020ecapa,
title = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
author = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
booktitle = {Proc. Interspeech},
year = {2020}
}
License
MIT