You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CASE Speaker Embedding v2 (512 channels) Case Benchmark

Carrier-Agnostic Speaker Embeddings (CASE) - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions.

Model Description

This model is based on the ECAPA-TDNN architecture with:

512 channels (~6.2M parameters)
192-dimensional L2-normalized embeddings
Global context attention in the pooling layer
Trained on VoxCeleb2 with CASE v2 augmentation pipeline

CASE v2 Augmentation Pipeline

The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation:

Mode	Distribution in Training Set	Description
Clean	15%	No augmentation
Single Codec	10%	GSM, G.711, Opus, MP3, AAC, G.722
Single Mic	10%	10 microphone profiles (webcam, laptop, phone, etc.)
Codec + Mic	15%	VoIP simulation
Light Chain	25%	Reverb → Codec (reverberant room transmitted)
Full Chain	25%	Codec → Speaker → Room → Mic (replay attack)

Usage

Installation

pip install torch torchaudio numpy

Quick Start

from model import CASESpeakerEncoder

# Load model
encoder = CASESpeakerEncoder.from_pretrained("./")

# Extract embedding from audio file
embedding = encoder.encode("audio.wav")  # Returns (192,) numpy array

# Verify two speakers
same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5)
print(f"Same speaker: {same_speaker}")

# Get similarity score
emb1 = encoder.encode("audio1.wav")
emb2 = encoder.encode("audio2.wav")
similarity = encoder.similarity(emb1, emb2)
print(f"Similarity: {similarity:.3f}")

Direct Model Usage

import torch
import torchaudio
from model import ECAPA_TDNN

# Load model
model = ECAPA_TDNN(channels=512, global_context_att=True)
state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Load audio (must be 16kHz)
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.transforms.Resample(sr, 16000)(wav)
wav = wav.mean(dim=0)  # Mono

# Extract embedding
with torch.no_grad():
    embedding = model(wav.unsqueeze(0))  # (1, 192)

Batch Processing

# Process multiple files
audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"]
embeddings = encoder.encode_batch(audio_files)  # (N, 192)

# Compute pairwise similarities
import numpy as np
similarity_matrix = embeddings @ embeddings.T

Input Requirements

Parameter	Value
Sample Rate	16000 Hz
Channels	Mono
Format	Float32 in [-1, 1] range
Min Duration	~0.5 seconds recommended
Max Duration	Any (uses attention pooling)

Output

Embedding dimension: 192
Normalization: L2-normalized (unit norm)
Similarity metric: Cosine similarity (dot product for normalized vectors)

Training Details

Parameter	Value
Architecture	ECAPA-TDNN (512 channels)
Dataset	VoxCeleb2 (5,994 speakers)
Loss	AAM-Softmax (margin=0.2, scale=30)
Optimizer	Adam (lr=0.001)
Epochs	70
Augmentation	CASE v2 + MUSAN noise

Benchmark Results (CASE Benchmark)

Evaluated on the CASE Benchmark:

Metric	Value
Clean EER	1.22%
Absolute EER	3.53%
Degradation	+2.31%

Category	Avg EER
Clean	1.22%
Codec	1.69%
Mic	1.23%
Noise	1.35%
Reverb	6.56%
Playback	9.10%

Key Finding: Achieves the lowest degradation factor (+2.31%) among tested models, validating the carrier-agnostic training approach.

Intended Use

This model is designed for:

Speaker verification: Determining if two audio samples are from the same speaker
Speaker identification: Matching against a database of enrolled speakers
Speaker diarization: As an embedding extractor for clustering
Robustness testing: Evaluating systems under acoustic degradation

Robustness Focus

Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by:

Telephone codecs (GSM, G.711, AMR)
VoIP compression (Opus, AAC)
Microphone variability (webcam, laptop, phone mics)
Room acoustics and reverberation
Replay attacks (speaker playback chains)

Limitations

Optimized for speech; may not perform well on non-speech audio
Best performance with audio >1 second
Not designed for speaker separation or enhancement
English-centric training data (VoxCeleb)

Related Resources

Resource	Description	Link
CASE Benchmark	Evaluation dataset with 24 protocols	HuggingFace Dataset
Benchmark Code	Evaluation scripts and tools	GitHub
Results	Full leaderboard and per-protocol breakdowns	Results
Metrics Guide	How to interpret benchmark metrics	Metrics Documentation

Citation

If you use this model, please cite:

@misc{case-speaker-embedding,
  title={CASE: Carrier-Agnostic Speaker Embeddings},
  year={2026},
  url={https://github.com/gittb/case-benchmark}
}

License

Apache 2.0

References

ECAPA-TDNN: Desplanques et al., 2020
VoxCeleb: Nagrani et al., 2020

Downloads last month: 1

Collection including bigstorm/case-speaker-embedding-v2-512

Carrier-Agnostic Speaker Verification Evaluation

Collection

Benchmark for evaluating speaker verification robustness across codecs, microphones, noise, and playback chains. • 2 items • Updated Feb 1

Papers for bigstorm/case-speaker-embedding-v2-512

VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge

Paper • 2012.06867 • Published Dec 12, 2020

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Paper • 2005.07143 • Published May 14, 2020