bigstorm's picture
Update README.md
349f4e2 verified
---
license: apache-2.0
language:
- en
tags:
- speaker-verification
- speaker-embedding
- speaker-recognition
- audio
- ecapa-tdnn
- pytorch
pipeline_tag: audio-classification
library_name: pytorch
---
# CASE Speaker Embedding v2 (512 channels) [Case Benchmark](https://github.com/gittb/case-benchmark)
**Carrier-Agnostic Speaker Embeddings (CASE)** - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions.
## Model Description
This model is based on the ECAPA-TDNN architecture with:
- **512 channels** (~6.2M parameters)
- **192-dimensional** L2-normalized embeddings
- **Global context attention** in the pooling layer
- Trained on **VoxCeleb2** with CASE v2 augmentation pipeline
### CASE v2 Augmentation Pipeline
The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation:
| Mode | Distribution in Training Set | Description |
|------|-------------|-------------|
| Clean | 15% | No augmentation |
| Single Codec | 10% | GSM, G.711, Opus, MP3, AAC, G.722 |
| Single Mic | 10% | 10 microphone profiles (webcam, laptop, phone, etc.) |
| Codec + Mic | 15% | VoIP simulation |
| Light Chain | 25% | Reverb → Codec (reverberant room transmitted) |
| Full Chain | 25% | Codec → Speaker → Room → Mic (replay attack) |
## Usage
### Installation
```bash
pip install torch torchaudio numpy
```
### Quick Start
```python
from model import CASESpeakerEncoder
# Load model
encoder = CASESpeakerEncoder.from_pretrained("./")
# Extract embedding from audio file
embedding = encoder.encode("audio.wav") # Returns (192,) numpy array
# Verify two speakers
same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5)
print(f"Same speaker: {same_speaker}")
# Get similarity score
emb1 = encoder.encode("audio1.wav")
emb2 = encoder.encode("audio2.wav")
similarity = encoder.similarity(emb1, emb2)
print(f"Similarity: {similarity:.3f}")
```
### Direct Model Usage
```python
import torch
import torchaudio
from model import ECAPA_TDNN
# Load model
model = ECAPA_TDNN(channels=512, global_context_att=True)
state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
# Load audio (must be 16kHz)
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
wav = torchaudio.transforms.Resample(sr, 16000)(wav)
wav = wav.mean(dim=0) # Mono
# Extract embedding
with torch.no_grad():
embedding = model(wav.unsqueeze(0)) # (1, 192)
```
### Batch Processing
```python
# Process multiple files
audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"]
embeddings = encoder.encode_batch(audio_files) # (N, 192)
# Compute pairwise similarities
import numpy as np
similarity_matrix = embeddings @ embeddings.T
```
## Input Requirements
| Parameter | Value |
|-----------|-------|
| Sample Rate | 16000 Hz |
| Channels | Mono |
| Format | Float32 in [-1, 1] range |
| Min Duration | ~0.5 seconds recommended |
| Max Duration | Any (uses attention pooling) |
## Output
- **Embedding dimension**: 192
- **Normalization**: L2-normalized (unit norm)
- **Similarity metric**: Cosine similarity (dot product for normalized vectors)
## Training Details
| Parameter | Value |
|-----------|-------|
| Architecture | ECAPA-TDNN (512 channels) |
| Dataset | VoxCeleb2 (5,994 speakers) |
| Loss | AAM-Softmax (margin=0.2, scale=30) |
| Optimizer | Adam (lr=0.001) |
| Epochs | 70 |
| Augmentation | CASE v2 + MUSAN noise |
## Benchmark Results (CASE Benchmark)
Evaluated on the [CASE Benchmark](https://github.com/gittb/case-benchmark):
| Metric | Value |
|--------|-------|
| **Clean EER** | 1.22% |
| **Absolute EER** | 3.53% |
| **Degradation** | +2.31% |
| Category | Avg EER |
|----------|---------|
| Clean | 1.22% |
| Codec | 1.69% |
| Mic | 1.23% |
| Noise | 1.35% |
| Reverb | 6.56% |
| Playback | 9.10% |
**Key Finding:** Achieves the **lowest degradation factor** (+2.31%) among tested models, validating the carrier-agnostic training approach.
## Intended Use
This model is designed for:
- **Speaker verification**: Determining if two audio samples are from the same speaker
- **Speaker identification**: Matching against a database of enrolled speakers
- **Speaker diarization**: As an embedding extractor for clustering
- **Robustness testing**: Evaluating systems under acoustic degradation
### Robustness Focus
Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by:
- Telephone codecs (GSM, G.711, AMR)
- VoIP compression (Opus, AAC)
- Microphone variability (webcam, laptop, phone mics)
- Room acoustics and reverberation
- Replay attacks (speaker playback chains)
## Limitations
- Optimized for speech; may not perform well on non-speech audio
- Best performance with audio >1 second
- Not designed for speaker separation or enhancement
- English-centric training data (VoxCeleb)
## Related Resources
| Resource | Description | Link |
|----------|-------------|------|
| **CASE Benchmark** | Evaluation dataset with 24 protocols | [HuggingFace Dataset](https://huggingface.co/datasets/gittb/case-benchmark) |
| **Benchmark Code** | Evaluation scripts and tools | [GitHub](https://github.com/gittb/case-benchmark) |
| **Results** | Full leaderboard and per-protocol breakdowns | [Results](https://github.com/gittb/case-benchmark/tree/master/results) |
| **Metrics Guide** | How to interpret benchmark metrics | [Metrics Documentation](https://github.com/gittb/case-benchmark/blob/master/docs/metrics.md) |
## Citation
If you use this model, please cite:
```bibtex
@misc{case-speaker-embedding,
title={CASE: Carrier-Agnostic Speaker Embeddings},
year={2026},
url={https://github.com/gittb/case-benchmark}
}
```
## License
Apache 2.0
## References
- ECAPA-TDNN: [Desplanques et al., 2020](https://arxiv.org/abs/2005.07143)
- VoxCeleb: [Nagrani et al., 2020](https://arxiv.org/abs/2012.06867)