|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# Auden-Voice |
|
|
|
|
|
**Auden-Voice** is a general-purpose voice encoder trained to learn robust speaker representations. |
|
|
|
|
|
The model is trained using **multi-task learning**, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model type**: Voice encoder |
|
|
- **Architecture**: Zipformer |
|
|
- **Embedding dimension**: 768 |
|
|
- **Number of parameters**: ~156M |
|
|
- **Framework**: PyTorch |
|
|
- **Output**: Frame-level embeddings `[B, T, D]` |
|
|
- **Pooling**: User-defined (e.g., mean pooling for utterance-level embeddings) |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
### Training Strategy |
|
|
|
|
|
Multi-task learning was found to work best. The model is jointly trained on the following tasks: |
|
|
|
|
|
- Speaker identification |
|
|
- Emotion classification |
|
|
- Gender classification |
|
|
- Age classification |
|
|
|
|
|
This setup encourages the encoder to learn robust and general-purpose voice representations. |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model is trained on publicly available academic speech datasets, totaling approximately **2050 hours** of audio. |
|
|
|
|
|
| Task | Dataset(s) | #Samples | Hours | |
|
|
|-----|-----------|----------|-------| |
|
|
| Speaker Identification | VoxCeleb2 | 974k | 2026 | |
|
|
| Paralinguistic Tasks | CREMA-D, RAVDESS, IEMOCAP, TESS | 18.3k | 20 | |
|
|
|
|
|
### Training Code |
|
|
|
|
|
Full training scripts and configurations are available at: |
|
|
https://github.com/AudenAI/Auden/tree/main/examples/voice |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended to be used as a **general-purpose voice encoder** for: |
|
|
|
|
|
- Speaker identification and verification |
|
|
- Speaker diarization |
|
|
- Emotion, gender, and age classification |
|
|
- Audio–text and text–audio retrieval |
|
|
- Speech-related downstream tasks that benefit from pretrained voice embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Load the Encoder |
|
|
|
|
|
```python |
|
|
from auden.auto.auto_model import AutoModel |
|
|
import torch |
|
|
|
|
|
encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice") |
|
|
encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Extract Voice Embeddings |
|
|
import torch.nn.functional as F |
|
|
|
|
|
audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"] |
|
|
embeddings_list = [] |
|
|
|
|
|
for audio_file in audio_files: |
|
|
x, x_lens = encoder.extract_feature([audio_file]) |
|
|
x, x_lens = x.to(device), x_lens.to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
encoder_output = encoder(x, x_lens) |
|
|
frame_embeddings = encoder_output["encoder_out"] # [B, T, D] |
|
|
|
|
|
# Global average pooling (example for speaker verification) |
|
|
T = frame_embeddings.size(1) |
|
|
mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float() |
|
|
utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1) |
|
|
|
|
|
embeddings_list.append(utterance_embedding) |
|
|
|
|
|
embeddings = torch.cat(embeddings_list, dim=0) # [N, D] |
|
|
embeddings = F.normalize(embeddings, p=2, dim=-1) |
|
|
|
|
|
similarity = torch.matmul(embeddings[0], embeddings[1]) |
|
|
print(f"Cosine similarity: {similarity:.4f}") |
|
|
|
|
|
|
|
|
# Expected Output |
|
|
🎵 Audio 1: |
|
|
Frame embeddings shape: torch.Size([1, 97, 768]) |
|
|
Utterance embedding shape: torch.Size([1, 768]) |
|
|
|
|
|
🎵 Audio 2: |
|
|
Frame embeddings shape: torch.Size([1, 138, 768]) |
|
|
Utterance embedding shape: torch.Size([1, 768]) |
|
|
|
|
|
Cosine similarity: 0.7234 |
|
|
Same speaker: YES |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Task - Dataset | Metric | |
|
|
|----------------|--------| |
|
|
| Speaker Identification - VoxCeleb2 | Accuracy 95.25% | |
|
|
| Speaker Verification - VoxCeleb1-O | EER 3% | |
|
|
| Speaker Diarization - VoxConverse | DER 17% | |
|
|
| Age Classification - CREMA-D | Accuracy 93.91% | |
|
|
| Gender Classification - CREMA-D | Accuracy 99.72% | |
|
|
| Gender Classification - RAVDESS | Accuracy 100% | |
|
|
| Emotion Classification - CREMA-D | Accuracy 83.99% | |
|
|
| Emotion Classification - RAVDESS | Accuracy 89.71% | |
|
|
| Audio → Text Retrieval - ParaspeechCaps | R@1 63.31 | |
|
|
| Text → Audio Retrieval - ParaspeechCaps | R@1 61.69 | |
|
|
| LLM-QA Emotion - AirBench-MELD | Accuracy 27.23% | |
|
|
| LLM-QA Emotion - AirBench-IEMOCAP | Accuracy 84.70% | |
|
|
| LLM-QA Gender - AirBench-MELD | Accuracy 81.58% | |
|
|
| LLM-QA Gender - AirBench-CommonVoice | Accuracy 93.15% | |
|
|
| LLM-QA Age - AirBench-CommonVoice | Accuracy 58.27% | |
|
|
|
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is trained primarily on English speech data and may not generalize well to other languages. |
|
|
- The model is not evaluated on generative tasks such as speech synthesis or voice conversion. |
|
|
- Utterance-level representations depend on the pooling strategy selected by the user. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{huo2025auden, |
|
|
title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding}, |
|
|
author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong}, |
|
|
journal={arXiv preprint arXiv:2511.15145}, |
|
|
year={2025} |
|
|
} |