Update banner and fix README metadata

f7251d8 verified about 16 hours ago

4.76 kB

	---
	license: mit
	language:
	- en
	tags:
	- speaker-verification
	- speaker-recognition
	- ecapa-tdnn
	- quantization
	- qat
	- mixed-precision
	- edge
	datasets:
	- voxceleb2
	metrics:
	- eer
	---

	# ECAPA-QAT

	![ECAPA-QAT banner](banner.svg)

	Quantization-Aware Trained ECAPA-TDNN for Speaker Verification

	A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves 2.61% EER on VoxCeleb1-O while fitting in 4 MB on disk and 7.6 MB in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment.

	---

	## Highlights

	\| \| FP32 baseline \| ECAPA-QAT \|
	\|---\|---\|---\|
	\| EER (VoxCeleb1-O) \| 3.05 % \| 2.61 % \|
	\| File size \| 28.2 MB \| 4.0 MB \|
	\| RAM (weights) \| 80 MB \| 7.6 MB \|
	\| CPU latency (3 s audio) \| — \| 66 ms \|
	\| Parallel sessions / 1 GB RAM \| 23 \| 61–87 \|

	> The quantized model outperforms its FP32 counterpart — quantization-aware training acts as an implicit regularizer.

	---

	## Architecture

	ECAPA-QAT is based on [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) (Desplanques et al., Interspeech 2020) with C=512 channels.

	```
	Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz)
	│
	├─ block0 Conv1D + ReLU + BN (k=5) → INT8
	├─ block1 SE-Res2Block (k=3, d=2) → INT8
	├─ block2 SE-Res2Block (k=3, d=3) → INT4
	├─ block3 SE-Res2Block (k=3, d=4) → INT4
	├─ mfa Conv1D + ReLU (k=1, MFA) → INT4
	├─ asp Attentive Stat Pooling + BN → INT8
	└─ fn FC + BN → INT4

	Output: 192-dim L2-normalized speaker embedding
	```

	Mixed precision assignment is based on per-block sensitivity analysis:
	blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4.

	---

	## Training

	### Teacher pretraining
	- Dataset: VoxCeleb2 (5 994 speakers)
	- Loss: ArcFace (s=64, m=0.2 rad)
	- Output: FP32 teacher model

	### QAT with cosine distillation
	The student (quantized) model is trained to reproduce the FP32 teacher embeddings:

	```
	L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat)
	```

	Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE):

	```
	FQ(w) = s × clamp(round(w / s), −8, 7)
	s = max(\|w_j\|) / 7 # per-group scale, G = 128
	```

	### Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs

	\| Phase \| Epochs \| Active QAT layers \| lr \|
	\|---\|---\|---\|---\|
	\| 1 \| 15 \| asp_bn, fn (2 / 69) \| 1e-4 \|
	\| 2 \| 15 \| + block2, block3 (42 / 69) \| 1e-4 \|
	\| 3 \| 20 \| + block1, mfa, asp (67 / 69) \| 6e-4 \|
	\| 4 \| 20 \| + block0 (all 69) \| 4e-4 \|
	\| 5 \| 15 \| all 69 — fine-tune \| 1e-5 \|

	Layers are activated from least sensitive to most sensitive.
	BN statistics are frozen (eval mode) in all phases.

	---

	## Quick Start

	### Requirements

	```bash
	pip install torch torchaudio torchao
	```

	### Load the model

	```python
	import torch
	import torchaudio
	from model import EcapaTdnn # your ECAPA-TDNN definition

	# Load packed INT4/INT8 weights
	model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False)
	model.eval()

	# Extract embedding
	wav, sr = torchaudio.load("audio.wav")
	if sr != 16000:
	wav = torchaudio.functional.resample(wav, sr, 16000)

	with torch.no_grad():
	embedding = model(wav) # shape: [1, 192]
	embedding = torch.nn.functional.normalize(embedding, dim=-1)
	```

	### Speaker verification

	```python
	import torch.nn.functional as F

	emb_a = model(wav_a)
	emb_b = model(wav_b)

	score = F.cosine_similarity(emb_a, emb_b).item()
	decision = "ACCEPT" if score > 0.25 else "REJECT"
	print(f"Score: {score:.4f} → {decision}")
	```

	---

	## Evaluation

	Evaluated on VoxCeleb1-O (original trial list, 7 097 pairs, 10 speakers).

	```
	EER = 2.61 %
	```

	To reproduce:

	```bash
	python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt
	```

	---

	## Model Files

	\| File \| Description \| Size \|
	\|---\|---\|---\|
	\| `ecapa_teacher_qat_w4a4_phase5_best.pt` \| Training checkpoint (phase 5 best) \| ~28 MB (FP32 layout) \|
	\| `ecapa_qat_packed.pt` \| Inference-ready packed INT4/INT8 weights \| 4 MB \|

	Use `ecapa_qat_packed.pt` for inference. The checkpoint file is provided for reproducibility and further fine-tuning.

	---

	## Citation

	If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper:

	```bibtex
	@inproceedings{desplanques2020ecapa,
	title = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
	author = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
	booktitle = {Proc. Interspeech},
	year = {2020}
	}
	```

	---

	## License

	MIT