Update README.md

66098af verified about 14 hours ago

7.17 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	- pt
	- ru
	- fr
	- ja
	- ko
	- de
	- multilingual
	tags:
	- audio-generation
	- text-to-audio
	- text-to-speech
	- text-to-music
	- sound-effects
	- diffusion
	- multilingual
	library_name: transformers
	pipeline_tag: text-to-audio
	---

	# Dasheng-AudioGen-Multilingual

	[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2605.27838)
	[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/xiaomi-research/dasheng-audiogen)
	[![Hugging Face Model](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual)
	[![Hugging Face Demo](https://img.shields.io/badge/HuggingFace-Demo-orange?logo=huggingface)](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen)
	[![Web Demo](https://img.shields.io/badge/Website-Demo-181717?logo=google-chrome)](https://nieeim.github.io/Dasheng-AudioGen-Web/)
	<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) -->

	[English](./README.md) \| [中文](./README_zh.md)

	Dasheng-AudioGen-Multilingual is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize intelligible speech, music, sound effects, and environmental acoustics from text descriptions.

	<p align="center">
	<video
	src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900"
	controls
	autoplay
	muted
	loop
	playsinline
	width="85%">
	</video>
	</p>

	## Models

	\| Model \| HuggingFace \| Text Encoder \| Language \|
	\|-------\|-------------\|-------------\|:--------:\|
	\| Dasheng-AudioGen \| [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) \| `google/flan-t5-large` \| English \|
	\| Dasheng-AudioGen-Multilingual \| [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) \| `google/mt5-large` \| Multilingual \|

	### Language Support

	\| Language \| Duration (h) \| Proportion \|
	\|----------\|------------:\|----------:\|
	\| English \| 15,367.80 \| 58.86% \|
	\| Spanish \| 2,740.96 \| 10.50% \|
	\| Portuguese \| 1,916.24 \| 7.34% \|
	\| Russian \| 1,217.39 \| 4.66% \|
	\| French \| 933.91 \| 3.58% \|
	\| Japanese \| 874.51 \| 3.35% \|
	\| Korean \| 848.15 \| 3.25% \|
	\| German \| 842.29 \| 3.23% \|
	\| Other \| 1,369.16 \| 5.24% \|

	> Note: The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended.

	## Installation

	```bash
	pip install torch torchaudio "transformers<5" einops
	```

	> Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.

	## Prompt Format

	Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt must start with the `<\|caption\|>` tag, which provides the overall scene description. Other tags are optional and can be included as needed.

	\| Tag \| Description \| Required \|
	\|-----\|-------------\|:--------:\|
	\| `<\\|caption\\|>` \| Overall audio scene description \| Yes \|
	\| `<\\|speech\\|>` \| Speaker identity and speaking style \| No \|
	\| `<\\|asr\\|>` \| Spoken transcript / dialogue \| No \|
	\| `<\\|sfx\\|>` \| Sound effects \| No \|
	\| `<\\|music\\|>` \| Background music \| No \|
	\| `<\\|env\\|>` \| Environmental ambience \| No \|

	Rules:
	- The prompt must begin with `<\|caption\|>` — prompts without it will be rejected.
	- Only include tags that are relevant; omit tags with no content (e.g., skip `<\|music\|>` if there is no music).

	> Multilingual prompt convention: All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in English. Only the `<\|asr\|>` field (the actual spoken content to be synthesized) should use the target language.

	## Quick Start

	### Usage 1: Aspect-wise Composition

	Pass each aspect as a named argument. The `caption` field is required; all other fields are optional.

	```python
	import torchaudio
	from transformers import AutoModel

	model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

	prompt = model.compose_prompt(
	caption="A conversation scene on a busy city street.",
	speech="A young woman speaking softly in Spanish.",
	env="Rain and distant traffic noise.",
	asr="Creo que deberíamos irnos ya.",
	)
	audio = model.generate(prompt)
	torchaudio.save("output.wav", audio.cpu(), 16000)
	```

	### Usage 2: Pre-formatted Prompt String

	Pass a complete tagged string via the `prompt` parameter. The string must start with `<\|caption\|>`.

	```python
	import torchaudio
	from transformers import AutoModel

	model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

	prompt = model.compose_prompt(
	prompt="<\|caption\|> A conversation scene on a busy city street. <\|speech\|> A young woman speaking softly in Spanish. <\|asr\|> Creo que deberíamos irnos ya. <\|env\|> Rain and distant traffic noise."
	)
	audio = model.generate(prompt)
	torchaudio.save("output.wav", audio.cpu(), 16000)
	```

	### Batch Inference

	```python
	import torchaudio
	from transformers import AutoModel

	model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

	prompts = [
	model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
	model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
	model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
	]
	audios = model.generate(prompts)

	for i, audio in enumerate(audios):
	torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
	```

	### Generation Parameters

	```python
	import torchaudio
	from transformers import AutoModel

	model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

	prompt = model.compose_prompt(caption="A dog barking in a park")
	audio = model.generate(
	prompts=prompt,
	num_steps=25, # number of denoising steps (default: 25)
	guidance_scale=5.0, # classifier-free guidance scale (default: 5.0)
	sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear)
	)
	torchaudio.save("output.wav", audio.cpu(), 16000)
	```

	## Acknowledgments

	Dasheng-AudioGen was developed with contributions from XIAOMI LLM PLUS and SJTU X-LANCE.

	## Citation

	```bibtex
	@article{mei2026dashengaudiogen,
	title = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text},
	author = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu},
	journal = {arXiv preprint arXiv:2605.27838},
	year = {2026}
	}
	```

	## License

	This project is released under the [Apache License 2.0](LICENSE).