File size: 4,867 Bytes
63b0084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: apache-2.0
---
# Auden-Voice

**Auden-Voice** is a general-purpose voice encoder trained to learn robust speaker representations.

The model is trained using **multi-task learning**, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.

---

## Model Details

- **Model type**: Voice encoder
- **Architecture**: Zipformer
- **Embedding dimension**: 768
- **Number of parameters**: ~156M
- **Framework**: PyTorch
- **Output**: Frame-level embeddings `[B, T, D]`
- **Pooling**: User-defined (e.g., mean pooling for utterance-level embeddings)

---

## Training

### Training Strategy

Multi-task learning was found to work best. The model is jointly trained on the following tasks:

- Speaker identification
- Emotion classification
- Gender classification
- Age classification

This setup encourages the encoder to learn robust and general-purpose voice representations.

### Training Data

The model is trained on publicly available academic speech datasets, totaling approximately **2050 hours** of audio.

| Task | Dataset(s) | #Samples | Hours |
|-----|-----------|----------|-------|
| Speaker Identification | VoxCeleb2 | 974k | 2026 |
| Paralinguistic Tasks | CREMA-D, RAVDESS, IEMOCAP, TESS | 18.3k | 20 |

### Training Code

Full training scripts and configurations are available at:  
https://github.com/AudenAI/Auden/tree/main/examples/voice

---

## Intended Use

This model is intended to be used as a **general-purpose voice encoder** for:

- Speaker identification and verification
- Speaker diarization
- Emotion, gender, and age classification
- Audio–text and text–audio retrieval
- Speech-related downstream tasks that benefit from pretrained voice embeddings

---

## How to Use

### Load the Encoder

```python
from auden.auto.auto_model import AutoModel
import torch

encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")

# Extract Voice Embeddings
import torch.nn.functional as F

audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
embeddings_list = []

for audio_file in audio_files:
    x, x_lens = encoder.extract_feature([audio_file])
    x, x_lens = x.to(device), x_lens.to(device)

    with torch.no_grad():
        encoder_output = encoder(x, x_lens)
        frame_embeddings = encoder_output["encoder_out"]  # [B, T, D]

        # Global average pooling (example for speaker verification)
        T = frame_embeddings.size(1)
        mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
        utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)

        embeddings_list.append(utterance_embedding)

embeddings = torch.cat(embeddings_list, dim=0)  # [N, D]
embeddings = F.normalize(embeddings, p=2, dim=-1)

similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Cosine similarity: {similarity:.4f}")


# Expected Output
🎵 Audio 1:
   Frame embeddings shape: torch.Size([1, 97, 768])
   Utterance embedding shape: torch.Size([1, 768])

🎵 Audio 2: 
   Frame embeddings shape: torch.Size([1, 138, 768])
   Utterance embedding shape: torch.Size([1, 768])

Cosine similarity: 0.7234
Same speaker: YES
```

## Performance

| Task - Dataset | Metric |
|----------------|--------|
| Speaker Identification - VoxCeleb2 | Accuracy 95.25% |
| Speaker Verification - VoxCeleb1-O | EER 3% |
| Speaker Diarization - VoxConverse | DER 17% |
| Age Classification - CREMA-D | Accuracy 93.91% |
| Gender Classification - CREMA-D | Accuracy 99.72% |
| Gender Classification - RAVDESS | Accuracy 100% |
| Emotion Classification - CREMA-D | Accuracy 83.99% |
| Emotion Classification - RAVDESS | Accuracy 89.71% |
| Audio → Text Retrieval - ParaspeechCaps | R@1 63.31 |
| Text → Audio Retrieval - ParaspeechCaps | R@1 61.69 |
| LLM-QA Emotion - AirBench-MELD | Accuracy 27.23% |
| LLM-QA Emotion - AirBench-IEMOCAP | Accuracy 84.70% |
| LLM-QA Gender - AirBench-MELD | Accuracy 81.58% |
| LLM-QA Gender - AirBench-CommonVoice | Accuracy 93.15% |
| LLM-QA Age - AirBench-CommonVoice | Accuracy 58.27% |


## Limitations

- The model is trained primarily on English speech data and may not generalize well to other languages.
- The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
- Utterance-level representations depend on the pooling strategy selected by the user.

## Citation

If you use this model in your research, please cite:

```bibtex
@article{huo2025auden,
  title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
  author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
  journal={arXiv preprint arXiv:2511.15145},
  year={2025}
}