AudenAI
/

auden-encoder-voice

+---
+license: apache-2.0
+---
+# Auden-Voice
+**Auden-Voice** is a general-purpose voice encoder trained to learn robust speaker representations.
+The model is trained using **multi-task learning**, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.
+---
+## Model Details
+- **Model type**: Voice encoder
+- **Architecture**: Zipformer
+- **Embedding dimension**: 768
+- **Number of parameters**: ~156M
+- **Framework**: PyTorch
+- **Output**: Frame-level embeddings `[B, T, D]`
+- **Pooling**: User-defined (e.g., mean pooling for utterance-level embeddings)
+---
+## Training
+### Training Strategy
+Multi-task learning was found to work best. The model is jointly trained on the following tasks:
+- Speaker identification
+- Emotion classification
+- Gender classification
+- Age classification
+This setup encourages the encoder to learn robust and general-purpose voice representations.
+### Training Data
+The model is trained on publicly available academic speech datasets, totaling approximately **2050 hours** of audio.
+| Task | Dataset(s) | #Samples | Hours |
+|-----|-----------|----------|-------|
+| Speaker Identification | VoxCeleb2 | 974k | 2026 |
+| Paralinguistic Tasks | CREMA-D, RAVDESS, IEMOCAP, TESS | 18.3k | 20 |
+### Training Code
+Full training scripts and configurations are available at:
+https://github.com/AudenAI/Auden/tree/main/examples/voice
+---
+## Intended Use
+This model is intended to be used as a **general-purpose voice encoder** for:
+- Speaker identification and verification
+- Speaker diarization
+- Emotion, gender, and age classification
+- Audio–text and text–audio retrieval
+- Speech-related downstream tasks that benefit from pretrained voice embeddings
+---
+## How to Use
+### Load the Encoder
+```python
+from auden.auto.auto_model import AutoModel
+import torch
+encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
+encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")
+# Extract Voice Embeddings
+import torch.nn.functional as F
+audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
+embeddings_list = []
+for audio_file in audio_files:
+    x, x_lens = encoder.extract_feature([audio_file])
+    x, x_lens = x.to(device), x_lens.to(device)
+    with torch.no_grad():
+        encoder_output = encoder(x, x_lens)
+        frame_embeddings = encoder_output["encoder_out"]  # [B, T, D]
+        # Global average pooling (example for speaker verification)
+        T = frame_embeddings.size(1)
+        mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
+        utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)
+        embeddings_list.append(utterance_embedding)
+embeddings = torch.cat(embeddings_list, dim=0)  # [N, D]
+embeddings = F.normalize(embeddings, p=2, dim=-1)
+similarity = torch.matmul(embeddings[0], embeddings[1])
+print(f"Cosine similarity: {similarity:.4f}")
+# Expected Output
+🎵 Audio 1:
+   Frame embeddings shape: torch.Size([1, 97, 768])
+   Utterance embedding shape: torch.Size([1, 768])
+🎵 Audio 2:
+   Frame embeddings shape: torch.Size([1, 138, 768])
+   Utterance embedding shape: torch.Size([1, 768])
+Cosine similarity: 0.7234
+Same speaker: YES
+```
+## Performance
+| Task - Dataset | Metric |
+|----------------|--------|
+| Speaker Identification - VoxCeleb2 | Accuracy 95.25% |
+| Speaker Verification - VoxCeleb1-O | EER 3% |
+| Speaker Diarization - VoxConverse | DER 17% |
+| Age Classification - CREMA-D | Accuracy 93.91% |
+| Gender Classification - CREMA-D | Accuracy 99.72% |
+| Gender Classification - RAVDESS | Accuracy 100% |
+| Emotion Classification - CREMA-D | Accuracy 83.99% |
+| Emotion Classification - RAVDESS | Accuracy 89.71% |
+| Audio → Text Retrieval - ParaspeechCaps | R@1 63.31 |
+| Text → Audio Retrieval - ParaspeechCaps | R@1 61.69 |
+| LLM-QA Emotion - AirBench-MELD | Accuracy 27.23% |
+| LLM-QA Emotion - AirBench-IEMOCAP | Accuracy 84.70% |
+| LLM-QA Gender - AirBench-MELD | Accuracy 81.58% |
+| LLM-QA Gender - AirBench-CommonVoice | Accuracy 93.15% |
+| LLM-QA Age - AirBench-CommonVoice | Accuracy 58.27% |
+## Limitations
+- The model is trained primarily on English speech data and may not generalize well to other languages.
+- The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
+- Utterance-level representations depend on the pooling strategy selected by the user.
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@article{huo2025auden,
+  title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
+  author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
+  journal={arXiv preprint arXiv:2511.15145},
+  year={2025}
+}