Text-to-Audio
Transformers
Safetensors
dasheng_audiogen
feature-extraction
audio-generation
text-to-speech
text-to-music
sound-effects
diffusion
custom_code
Instructions to use mispeech/Dasheng-AudioGen-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mispeech/Dasheng-AudioGen-Multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Dasheng-AudioGen-Multilingual | |
| [](https://arxiv.org/abs/2605.27838) | |
| [](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | |
| [](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen) | |
| [](https://nieeim.github.io/Dasheng-AudioGen-Web/) | |
| [](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) | |
| [**English**](./README.md) | [**中文**](./README_zh.md) | |
| **Dasheng-AudioGen-Multilingual** 是 Dasheng-AudioGen 的多语言版本,是一个统一的音频生成模型,能够根据文本描述同时合成**语音、音乐、音效和环境声**。 | |
| <p align="center"> | |
| <video | |
| src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900" | |
| controls | |
| autoplay | |
| muted | |
| loop | |
| playsinline | |
| width="85%"> | |
| </video> | |
| </p> | |
| ## 模型 | |
| | 模型 | HuggingFace | 文本编码器 | 语言支持 | | |
| |------|-------------|-----------|:--------:| | |
| | Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | 英语 | | |
| | Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | 多语言 | | |
| ### 多语言支持 | |
| | 语言 | 时长 (h) | 占比 | | |
| |------|--------:|-----:| | |
| | 英语 (English) | 15,367.80 | 58.86% | | |
| | 西班牙语 (Spanish) | 2,740.96 | 10.50% | | |
| | 葡萄牙语 (Portuguese) | 1,916.24 | 7.34% | | |
| | 俄语 (Russian) | 1,217.39 | 4.66% | | |
| | 法语 (French) | 933.91 | 3.58% | | |
| | 日语 (Japanese) | 874.51 | 3.35% | | |
| | 韩语 (Korean) | 848.15 | 3.25% | | |
| | 德语 (German) | 842.29 | 3.23% | | |
| | 其他 | 1,369.16 | 5.24% | | |
| > **注意:** 当前多语言模型在所有非英语语言上的合成错误率都明显偏高,表中未列出的语言更不稳定。如果仅需英语生成,建议使用基础模型 (`mispeech/Dasheng-AudioGen`)。 | |
| ## 安装 | |
| ```bash | |
| pip install torch torchaudio "transformers<5" einops | |
| ``` | |
| > 已在 Python 3.10、torch 2.8.0+cu128、transformers 4.57 上测试通过。已知不兼容 transformers 5.x。 | |
| ## Prompt 格式 | |
| Dasheng-AudioGen 使用结构化标签来描述不同的音频维度。有效的 prompt **必须以 `<|caption|>` 标签开头**,用于描述整体音频场景。其他标签为可选项,按需使用。 | |
| | 标签 | 描述 | 是否必需 | | |
| |------|------|:--------:| | |
| | `<\|caption\|>` | 整体音频场景描述 | 是 | | |
| | `<\|speech\|>` | 说话人身份和说话风格 | 否 | | |
| | `<\|asr\|>` | 语音转写内容 / 对话文本 | 否 | | |
| | `<\|sfx\|>` | 音效 | 否 | | |
| | `<\|music\|>` | 背景音乐 | 否 | | |
| | `<\|env\|>` | 环境音 | 否 | | |
| **规则:** | |
| - Prompt 必须以 `<|caption|>` 开头,否则会报错。 | |
| - 仅包含有实际内容的标签;没有对应内容的标签请省略(例如没有音乐则不传 `<|music|>`)。 | |
| > **多语言 prompt 规范:** 使用多语言模型时,所有描述性标签(`caption`、`speech`、`sfx`、`music`、`env`)应使用**英文**填写,仅 `<|asr|>` 字段(实际要合成的语音内容)使用目标语言。 | |
| ## 快速开始 | |
| ### 用法一:分维度组装 | |
| 通过命名参数分别传入各个维度的描述。`caption` 为必填项,其他字段可选。 | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| caption="A conversation scene on a busy city street.", | |
| speech="A young woman speaking softly in Spanish.", | |
| env="Rain and distant traffic noise.", | |
| asr="Creo que deberíamos irnos ya.", | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### 用法二:传入完整 Prompt 字符串 | |
| 通过 `prompt` 参数传入预格式化的标签字符串,该字符串必须以 `<|caption|>` 开头。 | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt( | |
| prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise." | |
| ) | |
| audio = model.generate(prompt) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ### 批量推理 | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompts = [ | |
| model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."), | |
| model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."), | |
| model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."), | |
| ] | |
| audios = model.generate(prompts) | |
| for i, audio in enumerate(audios): | |
| torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000) | |
| ``` | |
| ### 生成参数 | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda() | |
| prompt = model.compose_prompt(caption="A dog barking in a park") | |
| audio = model.generate( | |
| prompts=prompt, | |
| num_steps=25, # 去噪步数(默认:25) | |
| guidance_scale=5.0, # 无分类器引导强度(默认:5.0) | |
| sway_sampling_coef=-1.0, # sway 采样系数(默认:-1.0,设为 0 使用线性调度) | |
| ) | |
| torchaudio.save("output.wav", audio.cpu(), 16000) | |
| ``` | |
| ## 致谢 | |
| Dasheng-AudioGen 由**小米 LLM PLUS** 和 **上海交通大学 X-LANCE** 联合开发。 | |
| ## 引用 | |
| ```bibtex | |
| @article{mei2026dashengaudiogen, | |
| title = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text}, | |
| author = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu}, | |
| journal = {arXiv preprint arXiv:2605.27838}, | |
| year = {2026} | |
| } | |
| ``` | |
| ## 许可证 | |
| 本项目基于 [Apache License 2.0](LICENSE) 发布。 | |