yshao18 commited on
Commit
beba2a4
·
verified ·
1 Parent(s): 80f898d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ja
7
+ - ko
8
+ - fr
9
+ - es
10
+ - pt
11
+ - ru
12
+ - vi
13
+ - id
14
+ pipeline_tag: automatic-speech-recognition
15
+ tags:
16
+ - tta
17
+ - speech
18
+ - translation
19
+ - alignment
20
+ - multilingual
21
+ - retrieval
22
+ ---
23
+
24
+ # TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
25
+
26
+ **TTA** is a multilingual model that jointly supports **transcribe**, **translate**, and **align**
27
+ tasks. It provides strong multilingual ASR/ST performance and cross-lingual speech retrieval
28
+ capability.
29
+
30
+ 🔗 **Paper**: https://arxiv.org/abs/2511.14410
31
+ 🔗 **Model**: https://huggingface.co/AudenAI/auden-tta-m10
32
+ 🔗 **Encoder**: https://huggingface.co/AudenAI/auden-encoder-tta-m10
33
+ 🔗 **Code**: https://github.com/AudenAI/Auden/tree/main/examples/tta
34
+
35
+ ## 🔍 What Can This Model Do?
36
+
37
+ - 🎙️ **Multilingual ASR** (transcribe)
38
+ - 🌍 **Speech translation** (translate)
39
+ - 🧩 **Audio–text alignment** (align)
40
+ - 🔎 **Cross-lingual speech retrieval**
41
+
42
+ ## Quick Start
43
+
44
+ ### TTA model
45
+ ```python
46
+ from auden.auto.auto_model import AutoModel
47
+
48
+ # 1) Load a model checkpoint directory (contains config.json + weights)
49
+ model_dir = "AudenAI/auden-tta-m10" # or any exported directory / HF repo id
50
+ model = AutoModel.from_pretrained(model_dir)
51
+ model = model.to("cuda")
52
+ model.eval()
53
+
54
+ # 2) Prepare input features (x, x_lens). If you have raw audio, you can use
55
+ # model.speech_encoder.extract_feature(wav) to get (x, x_lens).
56
+ x, x_lens = ... # Tensor shapes: (B, T, F), (B,)
57
+
58
+ inputs = (x, x_lens)
59
+ # Alternatively, you can pass WAV inputs directly:
60
+ # - List of WAV paths (str):
61
+ # inputs = ["/abs/a.wav", "/abs/b.wav"]
62
+ # - List of mono waveforms (Tensor/ndarray), 16 kHz:
63
+ # inputs = [torch.randn(16000*5), torch.randn(16000*3)]
64
+
65
+ # 3a) Transcribe (RNNT greedy)
66
+ out = model.generate(inputs, task="transcribe", blank_penalty=0.0, return_timestamps=False)
67
+ print(out["hypotheses"]) # list[str]
68
+
69
+ # 3b) Translate (attention beam search). Language can be a single str or a list[str] per utterance
70
+ out = model.generate(
71
+ inputs,
72
+ task="translate",
73
+ beam_size=5,
74
+ source_language=["zh"] * x.size(0),
75
+ target_language=["en"] * x.size(0),
76
+ )
77
+ print(out["hypotheses"]) # list[str]
78
+ print(out["source_language"]) # list[str], model-predicted or provided
79
+ print(out["target_language"]) # list[str], model-predicted or provided
80
+
81
+ # 3c) Align (audio-text similarity)
82
+ texts = ["hello world", "good morning"]
83
+ out = model.generate(inputs, task="align", texts=texts)
84
+ print(out["similarities"]) # (B, len(texts))
85
+ print(out["audio_emb"]) # (B, emb_dim)
86
+ print(out["text_emb"]) # (B, emb_dim)
87
+ ```
88
+
89
+ ### TTA encoder
90
+ ```python
91
+ from auden.auto.auto_model import AutoModel
92
+ encoder = AutoModel.from_pretrained("AudenAI/auden-encoder-tta-m10")
93
+ encoder = encoder.to("cuda")
94
+
95
+ # 2) Prepare input features (x, x_lens). If you have raw audio, you can use
96
+ # encoder.extract_feature(wav) to get (x, x_lens).
97
+ x, x_lens = ... # Tensor shapes: (B, T, F), (B,)
98
+
99
+ encoder_output = encoder(x, x_lens)
100
+ print(encoder_output["encoder_out"]) # (B, T//4, D)
101
+ print(encoder_output["encoder_out_lens"]) # (B)
102
+ ```
103
+
104
+ ## 📌 Model Characteristics
105
+
106
+ - Input: Raw audio waveform (16 kHz recommended)
107
+ - Output: Transcription, translation, or alignment scores
108
+ - Encoder: TTA encoder (`AudenAI/auden-encoder-tta-m10`)
109
+ - Tasks: transcribe / translate / align
110
+
111
+ ## 📊 Evaluation
112
+
113
+ ### Multilingual ASR & ST
114
+
115
+ | Model | #Params | AISHELL1/2 (CER↓) | Wenet (CER↓) | LibriSpeech (WER↓) | CommonVoice (WER↓) | MLS (WER↓) | VoxPopuli (WER↓) | FLEURS (WER↓) | CoVoSTv2 (BLEU↑) |
116
+ |--------|----------|------------------|---------------|---------------------|--------------------|-------------|-------------------|----------------|-------------------|
117
+ | **Whisper Medium** | 762M | 6.74 / 6.23 | 11.00 / 22.68 | 2.88 / 6.08 | 11.86 | 7.27 | 12.08 | 6.62 | 35.12 |
118
+ | **Whisper Large-v2** | 1.54B | 5.90 / 5.24 | 9.47 / 22.77 | 2.64 / 5.14 | 9.70 | 5.65 | 11.90 | 5.20 | **38.80** |
119
+ | **Whisper Large-v3** | 1.54B | 5.33 / 4.76 | 9.00 / 15.68 | 2.01 / 3.89 | 8.30 | 4.48 | 13.78 | 4.51 | 37.60 |
120
+ | **ZT (ASR)** | 199M | 1.89 / 3.14 | 6.91 / 6.08 | 1.58 / 3.62 | 6.92 | 5.82 | 11.12 | 6.35 | – |
121
+ | **ZT-AED (ASR)** | 246M | 1.82 / 3.07 | 6.89 / 6.18 | 1.54 / 3.59 | 6.70 | 5.71 | 10.78 | 6.18 | – |
122
+ | **ZT-AED (Full)** | 246M | 1.80 / 3.03 | 6.96 / 5.94 | 1.56 / 3.76 | 6.69 | 5.72 | 10.88 | 6.17 | 34.72 |
123
+ | **🔥 TTA (Ours)** | **247M** | **1.85 / 3.09** | **7.06 / 6.44** | **1.58 / 3.85** | **6.76** | **5.74** | **10.87** | **6.19** | **35.28** |
124
+
125
+ ### TTA Encoder (LLM-ASR Encoder Evaluation)
126
+
127
+ | Encoder | Aishell CER↓ | LibriSpeech WER↓ |
128
+ |----------|---------------|------------------|
129
+ | Whisper-Medium | 5.47 | 4.66 |
130
+ | Whisper-Large | 4.87 | 3.64 |
131
+ | ZT-AED | 2.92 | 2.30 |
132
+ | **TTA (Ours)** | **1.92** | **1.95** |
133
+
134
+ ## Training Data
135
+
136
+ Full data composition (open-source links + in-house aggregation):
137
+
138
+ | Language | Data Source | Type | Hours | Total Hours | Share |
139
+ | :--- | :--- | :--- | :--- | :--- | :--- |
140
+ | **Chinese (Zh)** | [WenetSpeech](https://github.com/wenet-e2e/WenetSpeech) | Open Source | 10,005 | 129,265 | 37.1% |
141
+ | | [AISHELL-2](https://www.aishelltech.com/aishell_2) | Open Source | 1,000 |
142
+ | | [AISHELL-1](https://huggingface.co/datasets/AISHELL/AISHELL-1) | Open Source | 150 |
143
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 237 |
144
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 222 |
145
+ | | *In-house Data* | In-house | 117,651 |
146
+ | **Code-Switch** | [TALCS](https://github.com/SpeechClub/TALCS) | Open Source | 555 | 8,924 | 2.6% |
147
+ | | *In-house Data* | In-house | 8,369 |
148
+ | **English (En)** | [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy) | Open Source | 45,751 | 107,626 | 30.9% |
149
+ | | [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) | Open Source | 44,659 |
150
+ | | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Open Source | 10,000 |
151
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 3,426 |
152
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 1,778 |
153
+ | | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Open Source | 960 |
154
+ | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Open Source | 522 |
155
+ | | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | Open Source | 453 |
156
+ | | [AMI Corpus](https://huggingface.co/datasets/edinburgh-cstr/ami) | Open Source | 77 |
157
+ | **Japanese (Ja)** | [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) | Open Source | 35,389 | 40,426 | 11.6% |
158
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 499 |
159
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 19 |
160
+ | | *In-house Data* | In-house | 4,519 |
161
+ | **Korean (Ko)** | [KsponSpeech (AIHub)](https://huggingface.co/datasets/cheulyop/ksponspeech) | Open Source | 965 | 20,095 | 5.8% |
162
+ | | [KrespSpeech (AIHub)](https://aihub.or.kr/) | Open Source | 2,906 |
163
+ | | [KconfSpeech (AIHub)](https://aihub.or.kr/) | Open Source | 2,928 |
164
+ | | [MeetingSpeech (AIHub)](https://aihub.or.kr/) | Open Source | 4,962 |
165
+ | | [GyeongsangSpeech (AIHub)](https://aihub.or.kr/) | Open Source | 2,481 |
166
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 1,528 |
167
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 1 |
168
+ | | *In-house Data (Aggregated)* | In-house | 4,324 |
169
+ | **Russian (Ru)** | [Golos](https://huggingface.co/datasets/SberDevices/Golos) | Open Source | 1,221 | 15,246 | 4.4% |
170
+ | | [Public Speech & Radio](https://huggingface.co/datasets/bond005/sberdevices_golos_10h) | Open Source | 1,651 |
171
+ | | [Buriy Audiobook](https://huggingface.co/datasets/bond005/audio_books_russian) | Open Source | 874 |
172
+ | | Public Youtube Dataset | Open Source | 809 |
173
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 2,606 |
174
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 37 |
175
+ | | *In-house Data* | In-house | 8,048 |
176
+ | **Vietnamese (Vi)** | [GigaSpeech 2](https://huggingface.co/datasets/speechcolab/gigaspeech2) | Open Source | 6,048 | 8,390 | 2.4% |
177
+ | | [Bud500](https://huggingface.co/datasets/linhtran92/viet_bud500) | Open Source | 324 |
178
+ | | [VLSP 2020](https://vlsp.org.vn/vlsp2020) | Open Source | 101 |
179
+ | | [ViMD](https://github.com/NhutP/ViMD) | Open Source | 81 |
180
+ | | [LSVSC](https://huggingface.co/datasets/doof-ferb/LSVSC) | Open Source | 80 |
181
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 140 |
182
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 2 |
183
+ | | *In-house Data* | In-house | 1,614 |
184
+ | **Indonesian (Id)** | [GigaSpeech 2](https://huggingface.co/datasets/speechcolab/gigaspeech2) | Open Source | 6,352 | 8,238 | 2.4% |
185
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 442 |
186
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 7 |
187
+ | | *In-house Data* | In-house | 1,437 |
188
+ | **French (Fr)** | [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) | Open Source | 1,076 | 4,124 | 1.2% |
189
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 1,423 |
190
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 831 |
191
+ | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Open Source | 205 |
192
+ | | *In-house Data* | In-house | 589 |
193
+ | **Spanish (Es)** | [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) | Open Source | 917 | 4,596 | 1.3% |
194
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 2,399 |
195
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 502 |
196
+ | | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | Open Source | 151 |
197
+ | | *In-house Data* | In-house | 627 |
198
+ | **Portuguese (Pt)** | [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) | Open Source | 160 | 1,602 | 0.5% |
199
+ | | [Yodas](https://huggingface.co/datasets/espnet/yodas) | Open Source | 852 |
200
+ | | [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Open Source | 25 |
201
+ | | *In-house Data* | In-house | 565 |
202
+
203
+ Language totals from the same table:
204
+
205
+ | Language | Total Hours | Share |
206
+ | :--- | ---: | ---: |
207
+ | Chinese (Zh) | 129,265 | 37.1% |
208
+ | English (En) | 107,626 | 30.9% |
209
+ | Japanese (Ja) | 40,426 | 11.6% |
210
+ | Korean (Ko) | 20,095 | 5.8% |
211
+ | Russian (Ru) | 15,246 | 4.4% |
212
+ | Code-Switch | 8,924 | 2.6% |
213
+ | Vietnamese (Vi) | 8,390 | 2.4% |
214
+ | Indonesian (Id) | 8,238 | 2.4% |
215
+ | Spanish (Es) | 4,596 | 1.3% |
216
+ | French (Fr) | 4,124 | 1.2% |
217
+ | Portuguese (Pt) | 1,602 | 0.5% |
218
+
219
+ ## ⚠️ Limitations
220
+
221
+ - Performance depends on audio quality and recording conditions.
222
+ - For long-form audio, chunking and post-processing might be required for optimal performance.
223
+ - Not designed for safety-critical applications.
224
+
225
+ ## Citation
226
+
227
+ If you use this model in your research, please cite:
228
+
229
+ ```bibtex
230
+ @article{liu2025tta,
231
+ title={TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation},
232
+ author={Liu, Wei and Li, Jiahong and Shao, Yiwen and Yu, Dong},
233
+ journal={arXiv preprint arXiv:2511.14410},
234
+ year={2025}
235
+ }
236
+ ```