Create README.md

beba2a4 verified about 1 month ago

11.7 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- ja
	- ko
	- fr
	- es
	- pt
	- ru
	- vi
	- id
	pipeline_tag: automatic-speech-recognition
	tags:
	- tta
	- speech
	- translation
	- alignment
	- multilingual
	- retrieval
	---

	# TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

	TTA is a multilingual model that jointly supports transcribe, translate, and align
	tasks. It provides strong multilingual ASR/ST performance and cross-lingual speech retrieval
	capability.

	🔗 Paper: https://arxiv.org/abs/2511.14410
	🔗 Model: https://huggingface.co/AudenAI/auden-tta-m10
	🔗 Encoder: https://huggingface.co/AudenAI/auden-encoder-tta-m10
	🔗 Code: https://github.com/AudenAI/Auden/tree/main/examples/tta

	## 🔍 What Can This Model Do?

	- 🎙️ Multilingual ASR (transcribe)
	- 🌍 Speech translation (translate)
	- 🧩 Audio–text alignment (align)
	- 🔎 Cross-lingual speech retrieval

	## Quick Start

	### TTA model
	```python
	from auden.auto.auto_model import AutoModel

	# 1) Load a model checkpoint directory (contains config.json + weights)
	model_dir = "AudenAI/auden-tta-m10" # or any exported directory / HF repo id
	model = AutoModel.from_pretrained(model_dir)
	model = model.to("cuda")
	model.eval()

	# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
	# model.speech_encoder.extract_feature(wav) to get (x, x_lens).
	x, x_lens = ... # Tensor shapes: (B, T, F), (B,)

	inputs = (x, x_lens)
	# Alternatively, you can pass WAV inputs directly:
	# - List of WAV paths (str):
	# inputs = ["/abs/a.wav", "/abs/b.wav"]
	# - List of mono waveforms (Tensor/ndarray), 16 kHz:
	# inputs = [torch.randn(160005), torch.randn(160003)]

	# 3a) Transcribe (RNNT greedy)
	out = model.generate(inputs, task="transcribe", blank_penalty=0.0, return_timestamps=False)
	print(out["hypotheses"]) # list[str]

	# 3b) Translate (attention beam search). Language can be a single str or a list[str] per utterance
	out = model.generate(
	inputs,
	task="translate",
	beam_size=5,
	source_language=["zh"] * x.size(0),
	target_language=["en"] * x.size(0),
	)
	print(out["hypotheses"]) # list[str]
	print(out["source_language"]) # list[str], model-predicted or provided
	print(out["target_language"]) # list[str], model-predicted or provided

	# 3c) Align (audio-text similarity)
	texts = ["hello world", "good morning"]
	out = model.generate(inputs, task="align", texts=texts)
	print(out["similarities"]) # (B, len(texts))
	print(out["audio_emb"]) # (B, emb_dim)
	print(out["text_emb"]) # (B, emb_dim)
	```

	### TTA encoder
	```python
	from auden.auto.auto_model import AutoModel
	encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-tta-m10")
	encoder = encoder.to("cuda")

	# 2) Prepare input features (x, x_lens). If you have raw audio, you can use
	# encoder.extract_feature(wav) to get (x, x_lens).
	x, x_lens = ... # Tensor shapes: (B, T, F), (B,)

	encoder_output = encoder(x, x_lens)
	print(encoder_output["encoder_out"]) # (B, T//4, D)
	print(encoder_output["encoder_out_lens"]) # (B)
	```

	## 📌 Model Characteristics

	- Input: Raw audio waveform (16 kHz recommended)
	- Output: Transcription, translation, or alignment scores
	- Encoder: TTA encoder (`AudenAI/auden-encoder-tta-m10`)
	- Tasks: transcribe / translate / align

	## 📊 Evaluation

	### Multilingual ASR & ST

	\| Model \| #Params \| AISHELL1/2 (CER↓) \| Wenet (CER↓) \| LibriSpeech (WER↓) \| CommonVoice (WER↓) \| MLS (WER↓) \| VoxPopuli (WER↓) \| FLEURS (WER↓) \| CoVoSTv2 (BLEU↑) \|
	\|--------\|----------\|------------------\|---------------\|---------------------\|--------------------\|-------------\|-------------------\|----------------\|-------------------\|
	\| Whisper Medium \| 762M \| 6.74 / 6.23 \| 11.00 / 22.68 \| 2.88 / 6.08 \| 11.86 \| 7.27 \| 12.08 \| 6.62 \| 35.12 \|
	\| Whisper Large-v2 \| 1.54B \| 5.90 / 5.24 \| 9.47 / 22.77 \| 2.64 / 5.14 \| 9.70 \| 5.65 \| 11.90 \| 5.20 \| 38.80 \|
	\| Whisper Large-v3 \| 1.54B \| 5.33 / 4.76 \| 9.00 / 15.68 \| 2.01 / 3.89 \| 8.30 \| 4.48 \| 13.78 \| 4.51 \| 37.60 \|
	\| ZT (ASR) \| 199M \| 1.89 / 3.14 \| 6.91 / 6.08 \| 1.58 / 3.62 \| 6.92 \| 5.82 \| 11.12 \| 6.35 \| – \|
	\| ZT-AED (ASR) \| 246M \| 1.82 / 3.07 \| 6.89 / 6.18 \| 1.54 / 3.59 \| 6.70 \| 5.71 \| 10.78 \| 6.18 \| – \|
	\| ZT-AED (Full) \| 246M \| 1.80 / 3.03 \| 6.96 / 5.94 \| 1.56 / 3.76 \| 6.69 \| 5.72 \| 10.88 \| 6.17 \| 34.72 \|
	\| 🔥 TTA (Ours) \| 247M \| 1.85 / 3.09 \| 7.06 / 6.44 \| 1.58 / 3.85 \| 6.76 \| 5.74 \| 10.87 \| 6.19 \| 35.28 \|

	### TTA Encoder (LLM-ASR Encoder Evaluation)

	\| Encoder \| Aishell CER↓ \| LibriSpeech WER↓ \|
	\|----------\|---------------\|------------------\|
	\| Whisper-Medium \| 5.47 \| 4.66 \|
	\| Whisper-Large \| 4.87 \| 3.64 \|
	\| ZT-AED \| 2.92 \| 2.30 \|
	\| TTA (Ours) \| 1.92 \| 1.95 \|

	## Training Data

	Full data composition (open-source links + in-house aggregation):

	\| Language \| Data Source \| Type \| Hours \| Total Hours \| Share \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Chinese (Zh) \| [WenetSpeech](https://github.com/wenet-e2e/WenetSpeech) \| Open Source \| 10,005 \| 129,265 \| 37.1% \|
	\| \| [AISHELL-2](https://www.aishelltech.com/aishell_2) \| Open Source \| 1,000 \|
	\| \| [AISHELL-1](https://huggingface.co/datasets/AISHELL/AISHELL-1) \| Open Source \| 150 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 237 \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 222 \|
	\| \| In-house Data \| In-house \| 117,651 \|
	\| Code-Switch \| [TALCS](https://github.com/SpeechClub/TALCS) \| Open Source \| 555 \| 8,924 \| 2.6% \|
	\| \| In-house Data \| In-house \| 8,369 \|
	\| English (En) \| [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy) \| Open Source \| 45,751 \| 107,626 \| 30.9% \|
	\| \| [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) \| Open Source \| 44,659 \|
	\| \| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) \| Open Source \| 10,000 \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 3,426 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 1,778 \|
	\| \| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) \| Open Source \| 960 \|
	\| \| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) \| Open Source \| 522 \|
	\| \| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) \| Open Source \| 453 \|
	\| \| [AMI Corpus](https://huggingface.co/datasets/edinburgh-cstr/ami) \| Open Source \| 77 \|
	\| Japanese (Ja) \| [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) \| Open Source \| 35,389 \| 40,426 \| 11.6% \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 499 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 19 \|
	\| \| In-house Data \| In-house \| 4,519 \|
	\| Korean (Ko) \| [KsponSpeech (AIHub)](https://huggingface.co/datasets/cheulyop/ksponspeech) \| Open Source \| 965 \| 20,095 \| 5.8% \|
	\| \| [KrespSpeech (AIHub)](https://aihub.or.kr/) \| Open Source \| 2,906 \|
	\| \| [KconfSpeech (AIHub)](https://aihub.or.kr/) \| Open Source \| 2,928 \|
	\| \| [MeetingSpeech (AIHub)](https://aihub.or.kr/) \| Open Source \| 4,962 \|
	\| \| [GyeongsangSpeech (AIHub)](https://aihub.or.kr/) \| Open Source \| 2,481 \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 1,528 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 1 \|
	\| \| In-house Data (Aggregated) \| In-house \| 4,324 \|
	\| Russian (Ru) \| [Golos](https://huggingface.co/datasets/SberDevices/Golos) \| Open Source \| 1,221 \| 15,246 \| 4.4% \|
	\| \| [Public Speech & Radio](https://huggingface.co/datasets/bond005/sberdevices_golos_10h) \| Open Source \| 1,651 \|
	\| \| [Buriy Audiobook](https://huggingface.co/datasets/bond005/audio_books_russian) \| Open Source \| 874 \|
	\| \| Public Youtube Dataset \| Open Source \| 809 \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 2,606 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 37 \|
	\| \| In-house Data \| In-house \| 8,048 \|
	\| Vietnamese (Vi) \| [GigaSpeech 2](https://huggingface.co/datasets/speechcolab/gigaspeech2) \| Open Source \| 6,048 \| 8,390 \| 2.4% \|
	\| \| [Bud500](https://huggingface.co/datasets/linhtran92/viet_bud500) \| Open Source \| 324 \|
	\| \| [VLSP 2020](https://vlsp.org.vn/vlsp2020) \| Open Source \| 101 \|
	\| \| [ViMD](https://github.com/NhutP/ViMD) \| Open Source \| 81 \|
	\| \| [LSVSC](https://huggingface.co/datasets/doof-ferb/LSVSC) \| Open Source \| 80 \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 140 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 2 \|
	\| \| In-house Data \| In-house \| 1,614 \|
	\| Indonesian (Id) \| [GigaSpeech 2](https://huggingface.co/datasets/speechcolab/gigaspeech2) \| Open Source \| 6,352 \| 8,238 \| 2.4% \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 442 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 7 \|
	\| \| In-house Data \| In-house \| 1,437 \|
	\| French (Fr) \| [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) \| Open Source \| 1,076 \| 4,124 \| 1.2% \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 1,423 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 831 \|
	\| \| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) \| Open Source \| 205 \|
	\| \| In-house Data \| In-house \| 589 \|
	\| Spanish (Es) \| [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) \| Open Source \| 917 \| 4,596 \| 1.3% \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 2,399 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 502 \|
	\| \| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) \| Open Source \| 151 \|
	\| \| In-house Data \| In-house \| 627 \|
	\| Portuguese (Pt) \| [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) \| Open Source \| 160 \| 1,602 \| 0.5% \|
	\| \| [Yodas](https://huggingface.co/datasets/espnet/yodas) \| Open Source \| 852 \|
	\| \| [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| Open Source \| 25 \|
	\| \| In-house Data \| In-house \| 565 \|

	Language totals from the same table:

	\| Language \| Total Hours \| Share \|
	\| :--- \| ---: \| ---: \|
	\| Chinese (Zh) \| 129,265 \| 37.1% \|
	\| English (En) \| 107,626 \| 30.9% \|
	\| Japanese (Ja) \| 40,426 \| 11.6% \|
	\| Korean (Ko) \| 20,095 \| 5.8% \|
	\| Russian (Ru) \| 15,246 \| 4.4% \|
	\| Code-Switch \| 8,924 \| 2.6% \|
	\| Vietnamese (Vi) \| 8,390 \| 2.4% \|
	\| Indonesian (Id) \| 8,238 \| 2.4% \|
	\| Spanish (Es) \| 4,596 \| 1.3% \|
	\| French (Fr) \| 4,124 \| 1.2% \|
	\| Portuguese (Pt) \| 1,602 \| 0.5% \|

	## ⚠️ Limitations

	- Performance depends on audio quality and recording conditions.
	- For long-form audio, chunking and post-processing might be required for optimal performance.
	- Not designed for safety-critical applications.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{liu2025tta,
	title={TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation},
	author={Liu, Wei and Li, Jiahong and Shao, Yiwen and Yu, Dong},
	journal={arXiv preprint arXiv:2511.14410},
	year={2025}
	}
	```