File size: 7,165 Bytes

3c98fad
bedfeec
3c98fad
 
476ea22
 
 
 
3c98fad
 
 
bedfeec
3c98fad
 
 
bedfeec
 
3c98fad
bedfeec
3c98fad
bedfeec
3c98fad
 
 
 
 
f70c715
fdc7907
bedfeec
 
 
66098af
3c98fad
bedfeec
3c98fad
bedfeec
3c98fad
8db84b5
 
 
 
 
 
 
 
 
 
 
 
bedfeec
5b65166
bedfeec
 
 
 
5b65166
bedfeec
5b65166
 
 
 
 
 
 
 
 
 
 
 
 
bedfeec
3c98fad
bedfeec
dfc0841
 
bedfeec
dfc0841
 
bedfeec
dfc0841
8db84b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bedfeec
 
8db84b5
 
 
3c98fad
 
bedfeec
 
3c98fad
bedfeec
 
 
6e6ceeb
 
 
bedfeec
3c98fad
bedfeec
 
3c98fad
 
8db84b5
 
 
3c98fad
bedfeec
8db84b5
 
 
 
 
 
 
bedfeec
8db84b5
 
bedfeec
 
 
 
 
8db84b5
 
 
 
 
bedfeec
 
 
 
 
 
 
 
 
 
 
 
 
 
8db84b5
 
 
 
 
 
bedfeec
8db84b5
bedfeec
 
 
 
8db84b5
3c98fad
 
bedfeec
3c98fad
bedfeec
3c98fad
bedfeec
 
 
f70c715
 
 
 
 
bedfeec
 
 
 
3c98fad
bedfeec

---
license: apache-2.0
language:
  - en
  - es
  - pt
  - ru
  - fr
  - ja
  - ko
  - de
  - multilingual
tags:
  - audio-generation
  - text-to-audio
  - text-to-speech
  - text-to-music
  - sound-effects
  - diffusion
  - multilingual
library_name: transformers
pipeline_tag: text-to-audio
---

# Dasheng-AudioGen-Multilingual

[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2605.27838)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/xiaomi-research/dasheng-audiogen) 
[![Hugging Face Model](https://img.shields.io/badge/HuggingFace-Model-orange?logo=huggingface)](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual)
[![Hugging Face Demo](https://img.shields.io/badge/HuggingFace-Demo-orange?logo=huggingface)](https://huggingface.co/spaces/mispeech/Dasheng-AudioGen)
[![Web Demo](https://img.shields.io/badge/Website-Demo-181717?logo=google-chrome)](https://nieeim.github.io/Dasheng-AudioGen-Web/)
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual/resolve/main/notebook.ipynb) -->

[**English**](./README.md) | [**中文**](./README_zh.md)

**Dasheng-AudioGen-Multilingual** is the multilingual variant of Dasheng-AudioGen, a unified audio generation model that can jointly synthesize **intelligible speech, music, sound effects, and environmental acoustics** from text descriptions.

<p align="center">
  <video
    src="https://github.com/user-attachments/assets/497f5688-8731-4830-8ee7-b9cf4234d900"
    controls
    autoplay
    muted
    loop
    playsinline
    width="85%">
  </video>
</p>

## Models

| Model | HuggingFace | Text Encoder | Language |
|-------|-------------|-------------|:--------:|
| Dasheng-AudioGen | [mispeech/Dasheng-AudioGen](https://huggingface.co/mispeech/Dasheng-AudioGen) | `google/flan-t5-large` | English |
| Dasheng-AudioGen-Multilingual | [mispeech/Dasheng-AudioGen-Multilingual](https://huggingface.co/mispeech/Dasheng-AudioGen-Multilingual) | `google/mt5-large` | Multilingual |

### Language Support

| Language | Duration (h) | Proportion |
|----------|------------:|----------:|
| English | 15,367.80 | 58.86% |
| Spanish | 2,740.96 | 10.50% |
| Portuguese | 1,916.24 | 7.34% |
| Russian | 1,217.39 | 4.66% |
| French | 933.91 | 3.58% |
| Japanese | 874.51 | 3.35% |
| Korean | 848.15 | 3.25% |
| German | 842.29 | 3.23% |
| Other | 1,369.16 | 5.24% |

> **Note:** The current multilingual model has notably higher synthesis error rates for all non-English languages. Languages outside the table above are even less reliable. For English-only use cases, the base model (`mispeech/Dasheng-AudioGen`) is recommended.

## Installation

```bash
pip install torch torchaudio "transformers<5" einops
```

> Tested with Python 3.10, torch 2.8.0+cu128, transformers 4.57. Not compatible with transformers 5.x.

## Prompt Format

Dasheng-AudioGen uses structured tags to describe different audio aspects. A valid prompt **must start with the `<|caption|>` tag**, which provides the overall scene description. Other tags are optional and can be included as needed.

| Tag | Description | Required |
|-----|-------------|:--------:|
| `<\|caption\|>` | Overall audio scene description | Yes |
| `<\|speech\|>` | Speaker identity and speaking style | No |
| `<\|asr\|>` | Spoken transcript / dialogue | No |
| `<\|sfx\|>` | Sound effects | No |
| `<\|music\|>` | Background music | No |
| `<\|env\|>` | Environmental ambience | No |

**Rules:**
- The prompt must begin with `<|caption|>` — prompts without it will be rejected.
- Only include tags that are relevant; omit tags with no content (e.g., skip `<|music|>` if there is no music).

> **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language.

## Quick Start

### Usage 1: Aspect-wise Composition

Pass each aspect as a named argument. The `caption` field is required; all other fields are optional.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    caption="A conversation scene on a busy city street.",
    speech="A young woman speaking softly in Spanish.",
    env="Rain and distant traffic noise.",
    asr="Creo que deberíamos irnos ya.",
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Usage 2: Pre-formatted Prompt String

Pass a complete tagged string via the `prompt` parameter. The string must start with `<|caption|>`.

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(
    prompt="<|caption|> A conversation scene on a busy city street. <|speech|> A young woman speaking softly in Spanish. <|asr|> Creo que deberíamos irnos ya. <|env|> Rain and distant traffic noise."
)
audio = model.generate(prompt)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

### Batch Inference

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompts = [
    model.compose_prompt(caption="A cat meowing softly.", sfx="Soft cat meow."),
    model.compose_prompt(caption="Thunder rolling in the distance.", env="Stormy night ambience."),
    model.compose_prompt(caption="A piano playing a gentle melody.", music="Soft piano ballad."),
]
audios = model.generate(prompts)

for i, audio in enumerate(audios):
    torchaudio.save(f"output_{i}.wav", audio.unsqueeze(0).cpu(), 16000)
```

### Generation Parameters

```python
import torchaudio
from transformers import AutoModel

model = AutoModel.from_pretrained("mispeech/Dasheng-AudioGen-Multilingual", trust_remote_code=True).cuda()

prompt = model.compose_prompt(caption="A dog barking in a park")
audio = model.generate(
    prompts=prompt,
    num_steps=25,              # number of denoising steps (default: 25)
    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)
    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)
)
torchaudio.save("output.wav", audio.cpu(), 16000)
```

## Acknowledgments

Dasheng-AudioGen was developed with contributions from **XIAOMI LLM PLUS** and **SJTU X-LANCE**.

## Citation

```bibtex
@article{mei2026dashengaudiogen,
  title   = {Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text},
  author  = {Jiahao Mei and Heinrich Dinkel and Yadong Niu and Xingwei Sun and Gang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan and Mengyue Wu},
  journal = {arXiv preprint arXiv:2605.27838},
  year    = {2026}
}
```

## License

This project is released under the [Apache License 2.0](LICENSE).