| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - speaker-verification |
| | - speaker-embedding |
| | - speaker-recognition |
| | - audio |
| | - ecapa-tdnn |
| | - pytorch |
| | pipeline_tag: audio-classification |
| | library_name: pytorch |
| | --- |
| | |
| | # CASE Speaker Embedding v2 (512 channels) [Case Benchmark](https://github.com/gittb/case-benchmark) |
| |
|
| | **Carrier-Agnostic Speaker Embeddings (CASE)** - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions. |
| |
|
| | ## Model Description |
| |
|
| | This model is based on the ECAPA-TDNN architecture with: |
| | - **512 channels** (~6.2M parameters) |
| | - **192-dimensional** L2-normalized embeddings |
| | - **Global context attention** in the pooling layer |
| | - Trained on **VoxCeleb2** with CASE v2 augmentation pipeline |
| |
|
| | ### CASE v2 Augmentation Pipeline |
| |
|
| | The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation: |
| |
|
| | | Mode | Distribution in Training Set | Description | |
| | |------|-------------|-------------| |
| | | Clean | 15% | No augmentation | |
| | | Single Codec | 10% | GSM, G.711, Opus, MP3, AAC, G.722 | |
| | | Single Mic | 10% | 10 microphone profiles (webcam, laptop, phone, etc.) | |
| | | Codec + Mic | 15% | VoIP simulation | |
| | | Light Chain | 25% | Reverb → Codec (reverberant room transmitted) | |
| | | Full Chain | 25% | Codec → Speaker → Room → Mic (replay attack) | |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install torch torchaudio numpy |
| | ``` |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | from model import CASESpeakerEncoder |
| | |
| | # Load model |
| | encoder = CASESpeakerEncoder.from_pretrained("./") |
| | |
| | # Extract embedding from audio file |
| | embedding = encoder.encode("audio.wav") # Returns (192,) numpy array |
| | |
| | # Verify two speakers |
| | same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5) |
| | print(f"Same speaker: {same_speaker}") |
| | |
| | # Get similarity score |
| | emb1 = encoder.encode("audio1.wav") |
| | emb2 = encoder.encode("audio2.wav") |
| | similarity = encoder.similarity(emb1, emb2) |
| | print(f"Similarity: {similarity:.3f}") |
| | ``` |
| |
|
| | ### Direct Model Usage |
| |
|
| | ```python |
| | import torch |
| | import torchaudio |
| | from model import ECAPA_TDNN |
| | |
| | # Load model |
| | model = ECAPA_TDNN(channels=512, global_context_att=True) |
| | state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True) |
| | model.load_state_dict(state_dict) |
| | model.eval() |
| | |
| | # Load audio (must be 16kHz) |
| | wav, sr = torchaudio.load("audio.wav") |
| | if sr != 16000: |
| | wav = torchaudio.transforms.Resample(sr, 16000)(wav) |
| | wav = wav.mean(dim=0) # Mono |
| | |
| | # Extract embedding |
| | with torch.no_grad(): |
| | embedding = model(wav.unsqueeze(0)) # (1, 192) |
| | ``` |
| |
|
| | ### Batch Processing |
| |
|
| | ```python |
| | # Process multiple files |
| | audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"] |
| | embeddings = encoder.encode_batch(audio_files) # (N, 192) |
| | |
| | # Compute pairwise similarities |
| | import numpy as np |
| | similarity_matrix = embeddings @ embeddings.T |
| | ``` |
| |
|
| | ## Input Requirements |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Sample Rate | 16000 Hz | |
| | | Channels | Mono | |
| | | Format | Float32 in [-1, 1] range | |
| | | Min Duration | ~0.5 seconds recommended | |
| | | Max Duration | Any (uses attention pooling) | |
| |
|
| | ## Output |
| |
|
| | - **Embedding dimension**: 192 |
| | - **Normalization**: L2-normalized (unit norm) |
| | - **Similarity metric**: Cosine similarity (dot product for normalized vectors) |
| |
|
| | ## Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Architecture | ECAPA-TDNN (512 channels) | |
| | | Dataset | VoxCeleb2 (5,994 speakers) | |
| | | Loss | AAM-Softmax (margin=0.2, scale=30) | |
| | | Optimizer | Adam (lr=0.001) | |
| | | Epochs | 70 | |
| | | Augmentation | CASE v2 + MUSAN noise | |
| |
|
| | ## Benchmark Results (CASE Benchmark) |
| |
|
| | Evaluated on the [CASE Benchmark](https://github.com/gittb/case-benchmark): |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | **Clean EER** | 1.22% | |
| | | **Absolute EER** | 3.53% | |
| | | **Degradation** | +2.31% | |
| |
|
| | | Category | Avg EER | |
| | |----------|---------| |
| | | Clean | 1.22% | |
| | | Codec | 1.69% | |
| | | Mic | 1.23% | |
| | | Noise | 1.35% | |
| | | Reverb | 6.56% | |
| | | Playback | 9.10% | |
| |
|
| | **Key Finding:** Achieves the **lowest degradation factor** (+2.31%) among tested models, validating the carrier-agnostic training approach. |
| |
|
| | ## Intended Use |
| |
|
| | This model is designed for: |
| | - **Speaker verification**: Determining if two audio samples are from the same speaker |
| | - **Speaker identification**: Matching against a database of enrolled speakers |
| | - **Speaker diarization**: As an embedding extractor for clustering |
| | - **Robustness testing**: Evaluating systems under acoustic degradation |
| |
|
| | ### Robustness Focus |
| |
|
| | Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by: |
| | - Telephone codecs (GSM, G.711, AMR) |
| | - VoIP compression (Opus, AAC) |
| | - Microphone variability (webcam, laptop, phone mics) |
| | - Room acoustics and reverberation |
| | - Replay attacks (speaker playback chains) |
| |
|
| | ## Limitations |
| |
|
| | - Optimized for speech; may not perform well on non-speech audio |
| | - Best performance with audio >1 second |
| | - Not designed for speaker separation or enhancement |
| | - English-centric training data (VoxCeleb) |
| |
|
| | ## Related Resources |
| |
|
| | | Resource | Description | Link | |
| | |----------|-------------|------| |
| | | **CASE Benchmark** | Evaluation dataset with 24 protocols | [HuggingFace Dataset](https://huggingface.co/datasets/gittb/case-benchmark) | |
| | | **Benchmark Code** | Evaluation scripts and tools | [GitHub](https://github.com/gittb/case-benchmark) | |
| | | **Results** | Full leaderboard and per-protocol breakdowns | [Results](https://github.com/gittb/case-benchmark/tree/master/results) | |
| | | **Metrics Guide** | How to interpret benchmark metrics | [Metrics Documentation](https://github.com/gittb/case-benchmark/blob/master/docs/metrics.md) | |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{case-speaker-embedding, |
| | title={CASE: Carrier-Agnostic Speaker Embeddings}, |
| | year={2026}, |
| | url={https://github.com/gittb/case-benchmark} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|
| | ## References |
| |
|
| | - ECAPA-TDNN: [Desplanques et al., 2020](https://arxiv.org/abs/2005.07143) |
| | - VoxCeleb: [Nagrani et al., 2020](https://arxiv.org/abs/2012.06867) |
| |
|