File size: 7,165 Bytes
3c98fad
bedfeec
3c98fad
 
476ea22
 
 
 
3c98fad
 
 
bedfeec
3c98fad
 
 
bedfeec
 
3c98fad
bedfeec
3c98fad
bedfeec
3c98fad
 
 
 
 
f70c715
fdc7907
bedfeec
 
 
66098af
3c98fad
bedfeec
3c98fad
bedfeec
3c98fad
8db84b5
 
 
 
 
 
 
 
 
 
 
 
bedfeec
5b65166
bedfeec
 
 
 
5b65166
bedfeec
5b65166
 
 
 
 
 
 
 
 
 
 
 
 
bedfeec
3c98fad
bedfeec
dfc0841
 
bedfeec
dfc0841
 
bedfeec
dfc0841
8db84b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bedfeec
 
8db84b5
 
 
3c98fad
 
bedfeec
 
3c98fad
bedfeec
 
 
6e6ceeb
 
 
bedfeec
3c98fad
bedfeec
 
3c98fad
 
8db84b5
 
 
3c98fad
bedfeec
8db84b5
 
 
 
 
 
 
bedfeec
8db84b5
 
bedfeec
 
 
 
 
8db84b5
 
 
 
 
bedfeec
 
 
 
 
 
 
 
 
 
 
 
 
 
8db84b5
 
 
 
 
 
bedfeec
8db84b5
bedfeec
 
 
 
8db84b5
3c98fad
 
bedfeec
3c98fad
bedfeec
3c98fad
bedfeec
 
 
f70c715
 
 
 
 
bedfeec
 
 
 
3c98fad
bedfeec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
license: apache-2.0
language:
  - en
  - es
  - pt
  - ru
  - fr
  - ja
  - ko
  - de
  - multilingual
tags:
  - audio-generation
  - text-to-audio
  - text-to-speech
  - text-to-music
  - sound-effects
  - diffusion
  - multilingual
library_name: transformers
pipeline_tag: text-to-audio
---

# Dasheng-AudioGen-Multilingual

[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2605.27838)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/xiaomi-research/dasheng-audiogen) 
[![Hugging Face Model](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual)
[![Hugging Face Demo](https://img.shields.io/badge/HuggingFace-Demo-orange?logo=huggingface)](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen)
[![Web Demo](https://img.shields.io/badge/Website-Demo-181717?logo=google-chrome)](https://nieeim.github.io/Dasheng-AudioGen-Web/)
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) -->

[**English**](./README.md) | [**中文**](./README_zh.md)

**Dasheng-AudioGen-Multilingual** is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions.

<p align="center">
  <video
    src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900"
    controls
    autoplay
    muted
    loop
    playsinline
    width="85%">
  </video>
</p>

## Models

| Model | HuggingFace | Text Encoder | Language |
|-------|-------------|-------------|:--------:|
| Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English |
| Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual |

### Language Support

| Language | Duration (h) | Proportion |
|----------|------------:|----------:|
| English | 15,367.80 | 58.86% |
| Spanish | 2,740.96 | 10.50% |
| Portuguese | 1,916.24 | 7.34% |
| Russian | 1,217.39 | 4.66% |
| French | 933.91 | 3.58% |
| Japanese | 874.51 | 3.35% |
| Korean | 848.15 | 3.25% |
| German | 842.29 | 3.23% |
| Other | 1,369.16 | 5.24% |

> **Note:** The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended.

## Installation

```bash
pip install torch torchaudio "transformers<5" einops
```

> Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.

## Prompt Format

Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed.

| Tag | Description | Required |
|-----|-------------|:--------:|
| `<\|caption\|>` | Overall audio scene description | Yes |
| `<\|speech\|>` | Speaker identity and speaking style | No |
| `<\|asr\|>` | Spoken transcript / dialogue | No |
| `<\|sfx\|>` | Sound effects | No |
| `<\|music\|>` | Background music | No |
| `<\|env\|>` | Environmental ambience | No |

**Rules:**
- The prompt must begin with `<|caption|>` — prompts without it will be rejected.
- Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music).

> **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language.

## Quick Start

### Usage 1: Aspect-wise Composition

Pass each aspect as a named argument. The `caption` field is required; all other fields are optional.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    caption="A conversation scene on a busy city street.",
    speech="A young woman speaking softly in Spanish.",
    env="Rain and distant traffic noise.",
    asr="Creo que deberíamos irnos ya.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Usage 2: Pre-formatted Prompt String

Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise."
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Batch Inference

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompts = [
    model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
    model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
    model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)

for i, audio in enumerate(audios):
    torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
```

### Generation Parameters

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(caption="A dog barking in a park")
audio = model.generate(
    prompts=prompt,
    num_steps=25,              # number of denoising steps (default: 25)
    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)
    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)
)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

## Acknowledgments

Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**.

## Citation

```bibtex
@article{mei2026dashengaudiogen,
  title   = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text},
  author  = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu},
  journal = {arXiv preprint arXiv:2605.27838},
  year    = {2026}
}
```

## License

This project is released under the [Apache License 2.0](LICENSE).