You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CASE Speaker Embedding v2 (512 channels)

Carrier-Agnostic Speaker Embeddings (CASE) - A robust speaker embedding model trained to generalize across acoustic carriers including phone codecs, webcam microphones, speaker playback chains, and degraded audio conditions.

Model Description

This model is based on the ECAPA-TDNN architecture with:

  • 512 channels (~6.2M parameters)
  • 192-dimensional L2-normalized embeddings
  • Global context attention in the pooling layer
  • Trained on VoxCeleb2 with CASE v2 augmentation pipeline

CASE v2 Augmentation Pipeline

The model was trained with a 6-mode carrier augmentation strategy designed to simulate real-world acoustic degradation:

Mode Probability Description
Clean 15% No augmentation
Single Codec 10% GSM, G.711, Opus, MP3, AAC, G.722
Single Mic 10% 10 microphone profiles (webcam, laptop, phone, etc.)
Codec + Mic 15% VoIP simulation
Light Chain 25% Reverb → Codec (reverberant room transmitted)
Full Chain 25% Codec → Speaker → Room → Mic (replay attack)

Usage

Installation

pip install torch torchaudio numpy

Quick Start

from model import CASESpeakerEncoder

# Load model
encoder = CASESpeakerEncoder.from_pretrained("./")

# Extract embedding from audio file
embedding = encoder.encode("audio.wav")  # Returns (192,) numpy array

# Verify two speakers
same_speaker = encoder.verify("audio1.wav", "audio2.wav", threshold=0.5)
print(f"Same speaker: {same_speaker}")

# Get similarity score
emb1 = encoder.encode("audio1.wav")
emb2 = encoder.encode("audio2.wav")
similarity = encoder.similarity(emb1, emb2)
print(f"Similarity: {similarity:.3f}")

Direct Model Usage

import torch
import torchaudio
from model import ECAPA_TDNN

# Load model
model = ECAPA_TDNN(channels=512, global_context_att=True)
state_dict = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True)
model.load_state_dict(state_dict)
model.eval()

# Load audio (must be 16kHz)
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.transforms.Resample(sr, 16000)(wav)
wav = wav.mean(dim=0)  # Mono

# Extract embedding
with torch.no_grad():
    embedding = model(wav.unsqueeze(0))  # (1, 192)

Batch Processing

# Process multiple files
audio_files = ["spk1_utt1.wav", "spk1_utt2.wav", "spk2_utt1.wav"]
embeddings = encoder.encode_batch(audio_files)  # (N, 192)

# Compute pairwise similarities
import numpy as np
similarity_matrix = embeddings @ embeddings.T

Input Requirements

Parameter Value
Sample Rate 16000 Hz
Channels Mono
Format Float32 in [-1, 1] range
Min Duration ~0.5 seconds recommended
Max Duration Any (uses attention pooling)

Output

  • Embedding dimension: 192
  • Normalization: L2-normalized (unit norm)
  • Similarity metric: Cosine similarity (dot product for normalized vectors)

Training Details

Parameter Value
Architecture ECAPA-TDNN (512 channels)
Dataset VoxCeleb2 (5,994 speakers)
Loss AAM-Softmax (margin=0.2, scale=30)
Optimizer Adam (lr=0.001)
Epochs 70
Augmentation CASE v2 + MUSAN noise

Benchmark Results (CASE Benchmark)

Evaluated on the CASE Benchmark:

Metric Value
Clean EER 1.22%
Absolute EER 3.53%
Degradation +2.31%
Category Avg EER
Clean 1.22%
Codec 1.69%
Mic 1.23%
Noise 1.35%
Reverb 6.56%
Playback 9.10%

Key Finding: Achieves the lowest degradation factor (+2.31%) among tested models, validating the carrier-agnostic training approach.

Intended Use

This model is designed for:

  • Speaker verification: Determining if two audio samples are from the same speaker
  • Speaker identification: Matching against a database of enrolled speakers
  • Speaker diarization: As an embedding extractor for clustering
  • Robustness testing: Evaluating systems under acoustic degradation

Robustness Focus

Unlike standard speaker embedding models, CASE is specifically trained to maintain performance when audio is degraded by:

  • Telephone codecs (GSM, G.711, AMR)
  • VoIP compression (Opus, AAC)
  • Microphone variability (webcam, laptop, phone mics)
  • Room acoustics and reverberation
  • Replay attacks (speaker playback chains)

Limitations

  • Optimized for speech; may not perform well on non-speech audio
  • Best performance with audio >1 second
  • Not designed for speaker separation or enhancement
  • English-centric training data (VoxCeleb)

Related Resources

Resource Description Link
CASE Benchmark Evaluation dataset with 24 protocols HuggingFace Dataset
Benchmark Code Evaluation scripts and tools GitHub
Results Full leaderboard and per-protocol breakdowns Results
Metrics Guide How to interpret benchmark metrics Metrics Documentation

Citation

If you use this model, please cite:

@misc{case-speaker-embedding,
  title={CASE: Carrier-Agnostic Speaker Embeddings},
  year={2026},
  url={https://github.com/gittb/case-benchmark}
}

License

Apache 2.0

References

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including bigstorm/case-speaker-embedding-v2-512

Papers for bigstorm/case-speaker-embedding-v2-512