yshao18 mingyue66 commited on
Commit
63b0084
·
verified ·
1 Parent(s): f862676

Update Model Card for Auden-Voice (#1)

Browse files

- Update Model Card for Auden-Voice (31b556ccc40cbe93df39cdda7cc10000011f51e3)


Co-authored-by: Mingyue Huo <mingyue66@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +154 -3
README.md CHANGED
@@ -1,3 +1,154 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Auden-Voice
5
+
6
+ **Auden-Voice** is a general-purpose voice encoder trained to learn robust speaker representations.
7
+
8
+ The model is trained using **multi-task learning**, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.
9
+
10
+ ---
11
+
12
+ ## Model Details
13
+
14
+ - **Model type**: Voice encoder
15
+ - **Architecture**: Zipformer
16
+ - **Embedding dimension**: 768
17
+ - **Number of parameters**: ~156M
18
+ - **Framework**: PyTorch
19
+ - **Output**: Frame-level embeddings `[B, T, D]`
20
+ - **Pooling**: User-defined (e.g., mean pooling for utterance-level embeddings)
21
+
22
+ ---
23
+
24
+ ## Training
25
+
26
+ ### Training Strategy
27
+
28
+ Multi-task learning was found to work best. The model is jointly trained on the following tasks:
29
+
30
+ - Speaker identification
31
+ - Emotion classification
32
+ - Gender classification
33
+ - Age classification
34
+
35
+ This setup encourages the encoder to learn robust and general-purpose voice representations.
36
+
37
+ ### Training Data
38
+
39
+ The model is trained on publicly available academic speech datasets, totaling approximately **2050 hours** of audio.
40
+
41
+ | Task | Dataset(s) | #Samples | Hours |
42
+ |-----|-----------|----------|-------|
43
+ | Speaker Identification | VoxCeleb2 | 974k | 2026 |
44
+ | Paralinguistic Tasks | CREMA-D, RAVDESS, IEMOCAP, TESS | 18.3k | 20 |
45
+
46
+ ### Training Code
47
+
48
+ Full training scripts and configurations are available at:
49
+ https://github.com/AudenAI/Auden/tree/main/examples/voice
50
+
51
+ ---
52
+
53
+ ## Intended Use
54
+
55
+ This model is intended to be used as a **general-purpose voice encoder** for:
56
+
57
+ - Speaker identification and verification
58
+ - Speaker diarization
59
+ - Emotion, gender, and age classification
60
+ - Audio–text and text–audio retrieval
61
+ - Speech-related downstream tasks that benefit from pretrained voice embeddings
62
+
63
+ ---
64
+
65
+ ## How to Use
66
+
67
+ ### Load the Encoder
68
+
69
+ ```python
70
+ from auden.auto.auto_model import AutoModel
71
+ import torch
72
+
73
+ encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
74
+ encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")
75
+
76
+ # Extract Voice Embeddings
77
+ import torch.nn.functional as F
78
+
79
+ audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
80
+ embeddings_list = []
81
+
82
+ for audio_file in audio_files:
83
+ x, x_lens = encoder.extract_feature([audio_file])
84
+ x, x_lens = x.to(device), x_lens.to(device)
85
+
86
+ with torch.no_grad():
87
+ encoder_output = encoder(x, x_lens)
88
+ frame_embeddings = encoder_output["encoder_out"] # [B, T, D]
89
+
90
+ # Global average pooling (example for speaker verification)
91
+ T = frame_embeddings.size(1)
92
+ mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
93
+ utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)
94
+
95
+ embeddings_list.append(utterance_embedding)
96
+
97
+ embeddings = torch.cat(embeddings_list, dim=0) # [N, D]
98
+ embeddings = F.normalize(embeddings, p=2, dim=-1)
99
+
100
+ similarity = torch.matmul(embeddings[0], embeddings[1])
101
+ print(f"Cosine similarity: {similarity:.4f}")
102
+
103
+
104
+ # Expected Output
105
+ 🎵 Audio 1:
106
+ Frame embeddings shape: torch.Size([1, 97, 768])
107
+ Utterance embedding shape: torch.Size([1, 768])
108
+
109
+ 🎵 Audio 2:
110
+ Frame embeddings shape: torch.Size([1, 138, 768])
111
+ Utterance embedding shape: torch.Size([1, 768])
112
+
113
+ Cosine similarity: 0.7234
114
+ Same speaker: YES
115
+ ```
116
+
117
+ ## Performance
118
+
119
+ | Task - Dataset | Metric |
120
+ |----------------|--------|
121
+ | Speaker Identification - VoxCeleb2 | Accuracy 95.25% |
122
+ | Speaker Verification - VoxCeleb1-O | EER 3% |
123
+ | Speaker Diarization - VoxConverse | DER 17% |
124
+ | Age Classification - CREMA-D | Accuracy 93.91% |
125
+ | Gender Classification - CREMA-D | Accuracy 99.72% |
126
+ | Gender Classification - RAVDESS | Accuracy 100% |
127
+ | Emotion Classification - CREMA-D | Accuracy 83.99% |
128
+ | Emotion Classification - RAVDESS | Accuracy 89.71% |
129
+ | Audio → Text Retrieval - ParaspeechCaps | R@1 63.31 |
130
+ | Text → Audio Retrieval - ParaspeechCaps | R@1 61.69 |
131
+ | LLM-QA Emotion - AirBench-MELD | Accuracy 27.23% |
132
+ | LLM-QA Emotion - AirBench-IEMOCAP | Accuracy 84.70% |
133
+ | LLM-QA Gender - AirBench-MELD | Accuracy 81.58% |
134
+ | LLM-QA Gender - AirBench-CommonVoice | Accuracy 93.15% |
135
+ | LLM-QA Age - AirBench-CommonVoice | Accuracy 58.27% |
136
+
137
+
138
+ ## Limitations
139
+
140
+ - The model is trained primarily on English speech data and may not generalize well to other languages.
141
+ - The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
142
+ - Utterance-level representations depend on the pooling strategy selected by the user.
143
+
144
+ ## Citation
145
+
146
+ If you use this model in your research, please cite:
147
+
148
+ ```bibtex
149
+ @article{huo2025auden,
150
+ title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
151
+ author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
152
+ journal={arXiv preprint arXiv:2511.15145},
153
+ year={2025}
154
+ }