Approximetal commited on
Commit
8ef7026
·
verified ·
1 Parent(s): af250da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -2
README.md CHANGED
@@ -7,6 +7,16 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models
11
 
12
- We present LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models
11
 
12
+ [LEMAS](https://lemas-project.github.io/LEMAS-Project/) is a large-scale extensible multilingual audio suite, providing the largest open-source multilingual speech corpus with word-level timestamps to our knowledge, covering over 150,000 hours across 10 major languages. Built with a rigorous alignment and confidence-based filtering pipeline, LEMAS supports diverse generative paradigms including zero-shot multilingual synthesis (LEMAS-TTS) and seamless speech editing (LEMAS-Edit).
13
+
14
+
15
+ <details><summary>Citation</summary>
16
+
17
+ @article{zhao2026lemas,
18
+ title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models},
19
+ author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu},
20
+ journal={arXiv preprint arXiv:2601.04233},
21
+ year={2026}
22
+ }