ECAPA-QAT / README.md
KIRILLEVS125's picture
Update banner and fix README metadata
f7251d8 verified
---
license: mit
language:
- en
tags:
- speaker-verification
- speaker-recognition
- ecapa-tdnn
- quantization
- qat
- mixed-precision
- edge
datasets:
- voxceleb2
metrics:
- eer
---
# ECAPA-QAT
![ECAPA-QAT banner](banner.svg)
**Quantization-Aware Trained ECAPA-TDNN for Speaker Verification**
A mixed-precision W(4/8)A32 speaker embedding model trained with a 5-phase progressive QAT strategy and cosine distillation. Achieves **2.61% EER** on VoxCeleb1-O while fitting in **4 MB** on disk and **7.6 MB** in RAM — making it suitable for CPU-only servers, edge devices, and mobile deployment.
---
## Highlights
| | FP32 baseline | **ECAPA-QAT** |
|---|---|---|
| EER (VoxCeleb1-O) | 3.05 % | **2.61 %** |
| File size | 28.2 MB | **4.0 MB** |
| RAM (weights) | 80 MB | **7.6 MB** |
| CPU latency (3 s audio) | — | **66 ms** |
| Parallel sessions / 1 GB RAM | 23 | **61–87** |
> The quantized model **outperforms** its FP32 counterpart — quantization-aware training acts as an implicit regularizer.
---
## Architecture
ECAPA-QAT is based on [ECAPA-TDNN](https://arxiv.org/abs/2005.07143) (Desplanques et al., Interspeech 2020) with C=512 channels.
```
Input: mel-spectrogram (80 filters, 25 ms window, 10 ms hop, 16 kHz)
├─ block0 Conv1D + ReLU + BN (k=5) → INT8
├─ block1 SE-Res2Block (k=3, d=2) → INT8
├─ block2 SE-Res2Block (k=3, d=3) → INT4
├─ block3 SE-Res2Block (k=3, d=4) → INT4
├─ mfa Conv1D + ReLU (k=1, MFA) → INT4
├─ asp Attentive Stat Pooling + BN → INT8
└─ fn FC + BN → INT4
Output: 192-dim L2-normalized speaker embedding
```
Mixed precision assignment is based on per-block sensitivity analysis:
blocks with ΔEER > 1 pp under INT4 are kept at INT8; the rest use INT4.
---
## Training
### Teacher pretraining
- Dataset: VoxCeleb2 (5 994 speakers)
- Loss: ArcFace (s=64, m=0.2 rad)
- Output: FP32 teacher model
### QAT with cosine distillation
The student (quantized) model is trained to reproduce the FP32 teacher embeddings:
```
L_QAT = 1 − (1/B) Σ cos(e_fp32, e_qat)
```
Weights are quantized via FakeQuantize with the Straight-Through Estimator (STE):
```
FQ(w) = s × clamp(round(w / s), −8, 7)
s = max(|w_j|) / 7 # per-group scale, G = 128
```
### Multi-Stage Fine-Tuning (MSFT) — 5 phases, 85 epochs
| Phase | Epochs | Active QAT layers | lr |
|---|---|---|---|
| 1 | 15 | asp_bn, fn (2 / 69) | 1e-4 |
| 2 | 15 | + block2, block3 (42 / 69) | 1e-4 |
| 3 | 20 | + block1, mfa, asp (67 / 69) | 6e-4 |
| 4 | 20 | + block0 (all 69) | 4e-4 |
| 5 | 15 | all 69 — fine-tune | 1e-5 |
Layers are activated from least sensitive to most sensitive.
BN statistics are frozen (eval mode) in all phases.
---
## Quick Start
### Requirements
```bash
pip install torch torchaudio torchao
```
### Load the model
```python
import torch
import torchaudio
from model import EcapaTdnn # your ECAPA-TDNN definition
# Load packed INT4/INT8 weights
model = torch.load("ecapa_qat_packed.pt", map_location="cpu", weights_only=False)
model.eval()
# Extract embedding
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
with torch.no_grad():
embedding = model(wav) # shape: [1, 192]
embedding = torch.nn.functional.normalize(embedding, dim=-1)
```
### Speaker verification
```python
import torch.nn.functional as F
emb_a = model(wav_a)
emb_b = model(wav_b)
score = F.cosine_similarity(emb_a, emb_b).item()
decision = "ACCEPT" if score > 0.25 else "REJECT"
print(f"Score: {score:.4f} → {decision}")
```
---
## Evaluation
Evaluated on **VoxCeleb1-O** (original trial list, 7 097 pairs, 10 speakers).
```
EER = 2.61 %
```
To reproduce:
```bash
python eval_qat.py --ckpt models/ecapa_teacher_qat_w4a4/ecapa_teacher_qat_w4a4_phase5_best.pt
```
---
## Model Files
| File | Description | Size |
|---|---|---|
| `ecapa_teacher_qat_w4a4_phase5_best.pt` | Training checkpoint (phase 5 best) | ~28 MB (FP32 layout) |
| `ecapa_qat_packed.pt` | Inference-ready packed INT4/INT8 weights | **4 MB** |
Use `ecapa_qat_packed.pt` for inference. The checkpoint file is provided for reproducibility and further fine-tuning.
---
## Citation
If you use ECAPA-QAT in your work, please cite the original ECAPA-TDNN paper:
```bibtex
@inproceedings{desplanques2020ecapa,
title = {{ECAPA-TDNN}: Emphasized Channel Attention, Propagation and Aggregation in {TDNN} Based Speaker Verification},
author = {Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
booktitle = {Proc. Interspeech},
year = {2020}
}
```
---
## License
MIT