ECAPA-QAT

Quantization-Aware Trained ECAPA-TDNN for Speaker Verification

A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves 2.61% EER on VoxCeleb1-O while fitting in 4 MB on disk and 7.6 MB in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment.

Highlights

	FP32 baseline	ECAPA-QAT
EER (VoxCeleb1-O)	3.05 %	2.61 %
File size	28.2 MB	4.0 MB
RAM (weights)	80 MB	7.6 MB
CPU latency (3 s audio)	—	66 ms
Parallel sessions / 1 GB RAM	23	61–87

The quantized model outperforms its FP32 counterpart — quantization-aware training acts as an implicit regularizer.

Architecture

ECAPA-QAT is based on ECAPA-TDNN (Desplanques et al., Interspeech 2020) with C=512 channels.

Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz)
  │
  ├─ block0   Conv1D + ReLU + BN  (k=5)          →  INT8
  ├─ block1   SE-Res2Block        (k=3, d=2)     →  INT8
  ├─ block2   SE-Res2Block        (k=3, d=3)     →  INT4
  ├─ block3   SE-Res2Block        (k=3, d=4)     →  INT4
  ├─ mfa      Conv1D + ReLU       (k=1, MFA)     →  INT4
  ├─ asp      Attentive Stat Pooling + BN        →  INT8
  └─ fn       FC + BN                            →  INT4

Output: 192-dim L2-normalized speaker embedding

Mixed precision assignment is based on per-block sensitivity analysis: blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4.

Training

Teacher pretraining

Dataset: VoxCeleb2 (5 994 speakers)
Loss: ArcFace (s=64, m=0.2 rad)
Output: FP32 teacher model

QAT with cosine distillation

The student (quantized) model is trained to reproduce the FP32 teacher embeddings:

L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat)

Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE):

FQ(w) = s × clamp(round(w / s), −8, 7)
s     = max(|w_j|) / 7        # per-group scale, G = 128

Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs

Phase	Epochs	Active QAT layers	lr
1	15	asp_bn, fn (2 / 69)	1e-4
2	15	+ block2, block3 (42 / 69)	1e-4
3	20	+ block1, mfa, asp (67 / 69)	6e-4
4	20	+ block0 (all 69)	4e-4
5	15	all 69 — fine-tune	1e-5

Layers are activated from least sensitive to most sensitive. BN statistics are frozen (eval mode) in all phases.

Quick Start

Requirements

pip install torch torchaudio torchao

Load the model

import torch
import torchaudio
from model import EcapaTdnn  # your ECAPA-TDNN definition

# Load packed INT4/INT8 weights
model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False)
model.eval()

# Extract embedding
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

with torch.no_grad():
    embedding = model(wav)          # shape: [1, 192]
    embedding = torch.nn.functional.normalize(embedding, dim=-1)

Speaker verification

import torch.nn.functional as F

emb_a = model(wav_a)
emb_b = model(wav_b)

score = F.cosine_similarity(emb_a, emb_b).item()
decision = "ACCEPT" if score > 0.25 else "REJECT"
print(f"Score: {score:.4f} → {decision}")

Evaluation

Evaluated on VoxCeleb1-O (original trial list, 7 097 pairs, 10 speakers).

EER  = 2.61 %

To reproduce:

python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt

Model Files

File	Description	Size
`ecapa_teacher_qat_w4a4_phase5_best.pt`	Training checkpoint (phase 5 best)	~28 MB (FP32 layout)
`ecapa_qat_packed.pt`	Inference-ready packed INT4/INT8 weights	4 MB

Use ecapa_qat_packed.pt for inference. The checkpoint file is provided for reproducibility and further fine-tuning.

Citation

If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper:

@inproceedings{desplanques2020ecapa,
  title     = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
  author    = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle = {Proc. Interspeech},
  year      = {2020}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for KIRILLEVS125/ECAPA-QAT

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Paper • 2005.07143 • Published May 14, 2020