CASE Speaker Embedding v2 (512 channels)
Carrier-Agnostic Speaker Embeddings (CASE) - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions.
Model Description
This model is based on the ECAPA-TDNN architecture with:
- 512 channels (~6.2M parameters)
- 192-dimensional L2-normalized embeddings
- Global context attention in the pooling layer
- Trained on VoxCeleb2 with CASE v2 augmentation pipeline
CASE v2 Augmentation Pipeline
The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation:
| Mode | Probability | Description |
|---|---|---|
| Clean | 15% | No augmentation |
| Single Codec | 10% | GSM, G.711, Opus, MP3, AAC, G.722 |
| Single Mic | 10% | 10 microphone profiles (webcam, laptop, phone, etc.) |
| Codec + Mic | 15% | VoIP simulation |
| Light Chain | 25% | Reverb → Codec (reverberant room transmitted) |
| Full Chain | 25% | Codec → Speaker → Room → Mic (replay attack) |
Usage
Installation
pip install torch torchaudio numpy
Quick Start
from model import CASESpeakerEncoder
# Load model
encoder = CASESpeakerEncoder.from_pretrained("./")
# Extract embedding from audio file
embedding = encoder.encode("audio.wav") # Returns (192,) numpy array
# Verify two speakers
same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5)
print(f"Same speaker: {same_speaker}")
# Get similarity score
emb1 = encoder.encode("audio1.wav")
emb2 = encoder.encode("audio2.wav")
similarity = encoder.similarity(emb1, emb2)
print(f"Similarity: {similarity:.3f}")
Direct Model Usage
import torch
import torchaudio
from model import ECAPA_TDNN
# Load model
model = ECAPA_TDNN(channels=512, global_context_att=True)
state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()
# Load audio (must be 16kHz)
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
wav = torchaudio.transforms.Resample(sr, 16000)(wav)
wav = wav.mean(dim=0) # Mono
# Extract embedding
with torch.no_grad():
embedding = model(wav.unsqueeze(0)) # (1, 192)
Batch Processing
# Process multiple files
audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"]
embeddings = encoder.encode_batch(audio_files) # (N, 192)
# Compute pairwise similarities
import numpy as np
similarity_matrix = embeddings @ embeddings.T
Input Requirements
| Parameter | Value |
|---|---|
| Sample Rate | 16000 Hz |
| Channels | Mono |
| Format | Float32 in [-1, 1] range |
| Min Duration | ~0.5 seconds recommended |
| Max Duration | Any (uses attention pooling) |
Output
- Embedding dimension: 192
- Normalization: L2-normalized (unit norm)
- Similarity metric: Cosine similarity (dot product for normalized vectors)
Training Details
| Parameter | Value |
|---|---|
| Architecture | ECAPA-TDNN (512 channels) |
| Dataset | VoxCeleb2 (5,994 speakers) |
| Loss | AAM-Softmax (margin=0.2, scale=30) |
| Optimizer | Adam (lr=0.001) |
| Epochs | 70 |
| Augmentation | CASE v2 + MUSAN noise |
Benchmark Results (CASE Benchmark)
Evaluated on the CASE Benchmark:
| Metric | Value |
|---|---|
| Clean EER | 1.22% |
| Absolute EER | 3.53% |
| Degradation | +2.31% |
| Category | Avg EER |
|---|---|
| Clean | 1.22% |
| Codec | 1.69% |
| Mic | 1.23% |
| Noise | 1.35% |
| Reverb | 6.56% |
| Playback | 9.10% |
Key Finding: Achieves the lowest degradation factor (+2.31%) among tested models, validating the carrier-agnostic training approach.
Intended Use
This model is designed for:
- Speaker verification: Determining if two audio samples are from the same speaker
- Speaker identification: Matching against a database of enrolled speakers
- Speaker diarization: As an embedding extractor for clustering
- Robustness testing: Evaluating systems under acoustic degradation
Robustness Focus
Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by:
- Telephone codecs (GSM, G.711, AMR)
- VoIP compression (Opus, AAC)
- Microphone variability (webcam, laptop, phone mics)
- Room acoustics and reverberation
- Replay attacks (speaker playback chains)
Limitations
- Optimized for speech; may not perform well on non-speech audio
- Best performance with audio >1 second
- Not designed for speaker separation or enhancement
- English-centric training data (VoxCeleb)
Related Resources
| Resource | Description | Link |
|---|---|---|
| CASE Benchmark | Evaluation dataset with 24 protocols | HuggingFace Dataset |
| Benchmark Code | Evaluation scripts and tools | GitHub |
| Results | Full leaderboard and per-protocol breakdowns | Results |
| Metrics Guide | How to interpret benchmark metrics | Metrics Documentation |
Citation
If you use this model, please cite:
@misc{case-speaker-embedding,
title={CASE: Carrier-Agnostic Speaker Embeddings},
year={2026},
url={https://github.com/gittb/case-benchmark}
}
License
Apache 2.0
References
- ECAPA-TDNN: Desplanques et al., 2020
- VoxCeleb: Nagrani et al., 2020
- Downloads last month
- 6