Text-to-Audio
Transformers
Safetensors
dasheng_audiogen
feature-extraction
audio-generation
text-to-speech
text-to-music
sound-effects
diffusion
custom_code
Instructions to use mispeech/Dasheng-AudioGen-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mispeech/Dasheng-AudioGen-Multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - es | |
| - pt | |
| - ru | |
| - fr | |
| - ja | |
| - ko | |
| - de | |
| - multilingual | |
| tags: | |
| - audio-generation | |
| - text-to-audio | |
| - text-to-speech | |
| - text-to-music | |
| - sound-effects | |
| - diffusion | |
| - multilingual | |
| library_name: transformers | |
| pipeline_tag: text-to-audio | |
| # Dasheng-AudioGen-Multilingual | |
| [](https://arxiv.org/abs/2605.27838) | |
| [](https://github.com/xiaomi-research/dasheng-audiogen) | |
| [](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | |
| [](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen) | |
| [](https://nieeim.github.io/Dasheng-AudioGen-Web/) | |
| <!-- [](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) --> | |
| [**English**](./README.md) | [**中文**](./README_zh.md) | |
| **Dasheng-AudioGen-Multilingual** is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions. | |
| <p align="center"> | |
| <video | |
| src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900" | |
| controls | |
| autoplay | |
| muted | |
| loop | |
| playsinline | |
| width="85%"> | |
| </video> | |
| </p> | |
| ## Models | |
| | Model | HuggingFace | Text Encoder | Language | | |
| |-------|-------------|-------------|:--------:| | |
| | Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English | | |
| | Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual | | |
| ### Language Support | |
| | Language | Duration (h) | Proportion | | |
| |----------|------------:|----------:| | |
| | English | 15,367.80 | 58.86% | | |
| | Spanish | 2,740.96 | 10.50% | | |
| | Portuguese | 1,916.24 | 7.34% | | |
| | Russian | 1,217.39 | 4.66% | | |
| | French | 933.91 | 3.58% | | |
| | Japanese | 874.51 | 3.35% | | |
| | Korean | 848.15 | 3.25% | | |
| | German | 842.29 | 3.23% | | |
| | Other | 1,369.16 | 5.24% | | |
| > **Note:** The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended. | |
| ## Installation | |
| ```bash | |
| pip install torch torchaudio "transformers<5" einops | |
| ``` | |
| > Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x. | |
| ## Prompt Format | |
| Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed. | |
| | Tag | Description | Required | | |
| |-----|-------------|:--------:| | |
| | `<\|caption\|>` | Overall audio scene description | Yes | | |
| | `<\|speech\|>` | Speaker identity and speaking style | No | | |
| | `<\|asr\|>` | Spoken transcript / dialogue | No | | |
| | `<\|sfx\|>` | Sound effects | No | | |
| | `<\|music\|>` | Background music | No | | |
| | `<\|env\|>` | Environmental ambience | No | | |
| **Rules:** | |
| - The prompt must begin with `<|caption|>` — prompts without it will be rejected. | |
| - Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music). | |
| > **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language. | |
| ## Quick Start | |
| ### Usage 1: Aspect-wise Composition | |
| Pass each aspect as a named argument. The `caption` field is required; all other fields are optional. | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| caption="A conversation scene on a busy city street.", | |
| speech="A young woman speaking softly in Spanish.", | |
| env="Rain and distant traffic noise.", | |
| asr="Creo que deberíamos irnos ya.", | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### Usage 2: Pre-formatted Prompt String | |
| Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`. | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise." | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### Batch Inference | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompts = [ | |
| model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."), | |
| model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."), | |
| model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."), | |
| ] | |
| audios = model.generate(prompts) | |
| for i, audio in enumerate(audios): | |
| torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000) | |
| ``` | |
| ### Generation Parameters | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt(caption="A dog barking in a park") | |
| audio = model.generate( | |
| prompts=prompt, | |
| num_steps=25, # number of denoising steps (default: 25) | |
| guidance_scale=5.0, # classifier-free guidance scale (default: 5.0) | |
| sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear) | |
| ) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ## Acknowledgments | |
| Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**. | |
| ## Citation | |
| ```bibtex | |
| @article{mei2026dashengaudiogen, | |
| title = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text}, | |
| author = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu}, | |
| journal = {arXiv preprint arXiv:2605.27838}, | |
| year = {2026} | |
| } | |
| ``` | |
| ## License | |
| This project is released under the [Apache License 2.0](LICENSE). | |