Update Model Card for Auden-Voice (#1)

63b0084 verified 6 days ago

4.87 kB

	---
	license: apache-2.0
	---
	# Auden-Voice

	Auden-Voice is a general-purpose voice encoder trained to learn robust speaker representations.

	The model is trained using multi-task learning, where jointly optimizing speaker identification, emotion, gender, and age classification objectives leads to more general and transferable voice representations.

	---

	## Model Details

	- Model type: Voice encoder
	- Architecture: Zipformer
	- Embedding dimension: 768
	- Number of parameters: ~156M
	- Framework: PyTorch
	- Output: Frame-level embeddings `[B, T, D]`
	- Pooling: User-defined (e.g., mean pooling for utterance-level embeddings)

	---

	## Training

	### Training Strategy

	Multi-task learning was found to work best. The model is jointly trained on the following tasks:

	- Speaker identification
	- Emotion classification
	- Gender classification
	- Age classification

	This setup encourages the encoder to learn robust and general-purpose voice representations.

	### Training Data

	The model is trained on publicly available academic speech datasets, totaling approximately 2050 hours of audio.

	\| Task \| Dataset(s) \| #Samples \| Hours \|
	\|-----\|-----------\|----------\|-------\|
	\| Speaker Identification \| VoxCeleb2 \| 974k \| 2026 \|
	\| Paralinguistic Tasks \| CREMA-D, RAVDESS, IEMOCAP, TESS \| 18.3k \| 20 \|

	### Training Code

	Full training scripts and configurations are available at:
	https://github.com/AudenAI/Auden/tree/main/examples/voice

	---

	## Intended Use

	This model is intended to be used as a general-purpose voice encoder for:

	- Speaker identification and verification
	- Speaker diarization
	- Emotion, gender, and age classification
	- Audio–text and text–audio retrieval
	- Speech-related downstream tasks that benefit from pretrained voice embeddings

	---

	## How to Use

	### Load the Encoder

	```python
	from auden.auto.auto_model import AutoModel
	import torch

	encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-voice")
	encoder = encoder.to("cuda" if torch.cuda.is_available() else "cpu")

	# Extract Voice Embeddings
	import torch.nn.functional as F

	audio_files = ["/path/to/audio1.wav", "/path/to/audio2.wav"]
	embeddings_list = []

	for audio_file in audio_files:
	x, x_lens = encoder.extract_feature([audio_file])
	x, x_lens = x.to(device), x_lens.to(device)

	with torch.no_grad():
	encoder_output = encoder(x, x_lens)
	frame_embeddings = encoder_output["encoder_out"] # [B, T, D]

	# Global average pooling (example for speaker verification)
	T = frame_embeddings.size(1)
	mask = (torch.arange(T, device=device).unsqueeze(0) < x_lens.unsqueeze(1)).unsqueeze(-1).float()
	utterance_embedding = (frame_embeddings * mask).sum(dim=1) / mask.sum(dim=1)

	embeddings_list.append(utterance_embedding)

	embeddings = torch.cat(embeddings_list, dim=0) # [N, D]
	embeddings = F.normalize(embeddings, p=2, dim=-1)

	similarity = torch.matmul(embeddings[0], embeddings[1])
	print(f"Cosine similarity: {similarity:.4f}")


	# Expected Output
	🎵 Audio 1:
	Frame embeddings shape: torch.Size([1, 97, 768])
	Utterance embedding shape: torch.Size([1, 768])

	🎵 Audio 2:
	Frame embeddings shape: torch.Size([1, 138, 768])
	Utterance embedding shape: torch.Size([1, 768])

	Cosine similarity: 0.7234
	Same speaker: YES
	```

	## Performance

	\| Task - Dataset \| Metric \|
	\|----------------\|--------\|
	\| Speaker Identification - VoxCeleb2 \| Accuracy 95.25% \|
	\| Speaker Verification - VoxCeleb1-O \| EER 3% \|
	\| Speaker Diarization - VoxConverse \| DER 17% \|
	\| Age Classification - CREMA-D \| Accuracy 93.91% \|
	\| Gender Classification - CREMA-D \| Accuracy 99.72% \|
	\| Gender Classification - RAVDESS \| Accuracy 100% \|
	\| Emotion Classification - CREMA-D \| Accuracy 83.99% \|
	\| Emotion Classification - RAVDESS \| Accuracy 89.71% \|
	\| Audio → Text Retrieval - ParaspeechCaps \| R@1 63.31 \|
	\| Text → Audio Retrieval - ParaspeechCaps \| R@1 61.69 \|
	\| LLM-QA Emotion - AirBench-MELD \| Accuracy 27.23% \|
	\| LLM-QA Emotion - AirBench-IEMOCAP \| Accuracy 84.70% \|
	\| LLM-QA Gender - AirBench-MELD \| Accuracy 81.58% \|
	\| LLM-QA Gender - AirBench-CommonVoice \| Accuracy 93.15% \|
	\| LLM-QA Age - AirBench-CommonVoice \| Accuracy 58.27% \|


	## Limitations

	- The model is trained primarily on English speech data and may not generalize well to other languages.
	- The model is not evaluated on generative tasks such as speech synthesis or voice conversion.
	- Utterance-level representations depend on the pooling strategy selected by the user.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{huo2025auden,
	title={Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding},
	author={Huo, Mingyue and Tseng, Wei-Cheng and Shao, Yiwen and Zhang, Hao and Yu, Dong},
	journal={arXiv preprint arXiv:2511.15145},
	year={2025}
	}