File size: 4,867 Bytes
63b0084 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
license: apache-2.0
---
# Auden-Voice
**Auden-Voice** is a general-purpose voice encoder trained to learn robust speaker representations.
The model is trained using **multi-task learning**, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.
---
## Model Details
- **Model type**: Voice encoder
- **Architecture**: Zipformer
- **Embedding dimension**: 768
- **Number of parameters**: ~156M
- **Framework**: PyTorch
- **Output**: Frame-level embeddings `[B, T, D]`
- **Pooling**: User-defined (e.g., mean pooling for utterance-level embeddings)
---
## Training
### Training Strategy
Multi-task learning was found to work best. The model is jointly trained on the following tasks:
- Speaker identification
- Emotion classification
- Gender classification
- Age classification
This setup encourages the encoder to learn robust and general-purpose voice representations.
### Training Data
The model is trained on publicly available academic speech datasets, totaling approximately **2050 hours** of audio.
| Task | Dataset(s) | #Samples | Hours |
|-----|-----------|----------|-------|
| Speaker Identification | VoxCeleb2 | 974k | 2026 |
| Paralinguistic Tasks | CREMA-D, RAVDESS, IEMOCAP, TESS | 18.3k | 20 |
### Training Code
Full training scripts and configurations are available at:
https://github.com/AudenAI/Auden/tree/main/examples/voice
---
## Intended Use
This model is intended to be used as a **general-purpose voice encoder** for:
- Speaker identification and verification
- Speaker diarization
- Emotion, gender, and age classification
- Audio–text and text–audio retrieval
- Speech-related downstream tasks that benefit from pretrained voice embeddings
---
## How to Use
### Load the Encoder
```python
from auden.auto.auto_model import AutoModel
import torch
encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")
# Extract Voice Embeddings
import torch.nn.functional as F
audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
embeddings_list = []
for audio_file in audio_files:
x, x_lens = encoder.extract_feature([audio_file])
x, x_lens = x.to(device), x_lens.to(device)
with torch.no_grad():
encoder_output = encoder(x, x_lens)
frame_embeddings = encoder_output["encoder_out"] # [B, T, D]
# Global average pooling (example for speaker verification)
T = frame_embeddings.size(1)
mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)
embeddings_list.append(utterance_embedding)
embeddings = torch.cat(embeddings_list, dim=0) # [N, D]
embeddings = F.normalize(embeddings, p=2, dim=-1)
similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")
# Expected Output
🎵 Audio 1:
Frame embeddings shape: torch.Size([1, 97, 768])
Utterance embedding shape: torch.Size([1, 768])
🎵 Audio 2:
Frame embeddings shape: torch.Size([1, 138, 768])
Utterance embedding shape: torch.Size([1, 768])
Cosine similarity: 0.7234
Same speaker: YES
```
## Performance
| Task - Dataset | Metric |
|----------------|--------|
| Speaker Identification - VoxCeleb2 | Accuracy 95.25% |
| Speaker Verification - VoxCeleb1-O | EER 3% |
| Speaker Diarization - VoxConverse | DER 17% |
| Age Classification - CREMA-D | Accuracy 93.91% |
| Gender Classification - CREMA-D | Accuracy 99.72% |
| Gender Classification - RAVDESS | Accuracy 100% |
| Emotion Classification - CREMA-D | Accuracy 83.99% |
| Emotion Classification - RAVDESS | Accuracy 89.71% |
| Audio → Text Retrieval - ParaspeechCaps | R@1 63.31 |
| Text → Audio Retrieval - ParaspeechCaps | R@1 61.69 |
| LLM-QA Emotion - AirBench-MELD | Accuracy 27.23% |
| LLM-QA Emotion - AirBench-IEMOCAP | Accuracy 84.70% |
| LLM-QA Gender - AirBench-MELD | Accuracy 81.58% |
| LLM-QA Gender - AirBench-CommonVoice | Accuracy 93.15% |
| LLM-QA Age - AirBench-CommonVoice | Accuracy 58.27% |
## Limitations
- The model is trained primarily on English speech data and may not generalize well to other languages.
- The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
- Utterance-level representations depend on the pooling strategy selected by the user.
## Citation
If you use this model in your research, please cite:
```bibtex
@article{huo2025auden,
title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
journal={arXiv preprint arXiv:2511.15145},
year={2025}
} |