| --- |
| license: apache-2.0 |
| license_name: apache-2.0-non-commercial |
| license_link: https://github.com/lizhaoqing/UNISON/blob/main/LICENSE |
| language: |
| - en |
| - zh |
| tags: |
| - audio |
| - text-to-audio |
| - text-to-speech |
| - zero-shot-tts |
| - audio-editing |
| - speech-editing |
| - flow-matching |
| - diffusion |
| - mm-dit |
| - llm-fusion |
| library_name: custom |
| pipeline_tag: text-to-audio |
| arxiv: 2605.31530 |
| --- |
| |
| # UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion |
|
|
| <p align="center"> |
| <a href="https://arxiv.org/abs/2605.31530"><img src="https://img.shields.io/badge/arXiv-Paper-B31B1B.svg" alt="arXiv Paper"></a> |
| |
| <a href="https://github.com/lizhaoqing/UNISON"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=GitHub&style=flat-square" alt="GitHub Code"></a> |
| |
| <a href="https://lizhaoqing.github.io/UNISON-demo/"><img src="https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=GitHub&style=flat-square" alt="Demo Page"></a> |
| |
| <a href="https://huggingface.co/jac22/UNISON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E" alt="Hugging Face Model"></a> |
| |
| <a href="https://github.com/lizhaoqing/UNISON/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0_NC-4285F4" alt="License"></a> |
| </p> |
|
|
| --- |
|
|
| UNISON is a unified latent flow-matching framework for audio and speech generation and editing. |
| Using a **single set of weights**, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning, |
| mixed speech-and-sound scene generation, and audio/speech-in-scene editing — all in one model, one architecture, one forward pass. |
|
|
|  |
|
|
| --- |
|
|
| ## Model variants in this repository |
|
|
| This repository hosts **two checkpoint variants**: |
|
|
| | Directory | VAE | DiT depth | Channels | Config | |
| |-----------|-----|-----------|----------|--------| |
| | `unison_D20S0_O_40ch/` | MMAudio **44 kHz** | 20 double + 0 single | 40 | `D20S0_O_40ch.yaml` | |
| | `unison_D24S0_O_20ch/` | MMAudio **16 kHz** | 24 double + 0 single | 20 | `D24S0_O_20ch.yaml` | |
|
|
| Both variants share the same Qwen2.5-Omni-7B text encoder and the same inference pipeline. |
|
|
| --- |
|
|
| ## Supported tasks |
|
|
| | Task | Prompt format | |
| |------|--------------| |
| | Text-to-Audio (T2A) | `[Audio] {caption}` | |
| | Text-to-Speech (TTS) | `[Speech] A {female/male} voice saying "{text}"` | |
| | Mixed Speech + Sound | `[Speech] A {gender} voice saying "{text}" [Audio] {background}` | |
| | Zero-shot Speaker Cloning | `[Speech with voice] {ref_text}, {target_text}` | |
| | Audio Scene Editing (add / remove / replace / denoise) | `[Edit] [Audio] {instruction}` | |
| | Speech-in-Scene Editing (content / insert / delete) | `[Edit] [Speech] {instruction}` | |
| | Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. From {t2}s to {t3}s, {event2}. ...` | |
|
|
| Task identity is encoded via a **mask channel**; source/reference audio is injected through |
| **VAE-encoded channel concatenation** — no separate encoders or task-specific heads needed. |
|
|
| --- |
|
|
| ## Architecture |
|
|
| All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass. |
| Text conditioning uses **layer-wise deep LLM fusion**: hidden states from uniformly sampled layers |
| of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks |
| via learned linear projections. |
|
|
|  |
|
|
| --- |
|
|
| ## Quick start |
|
|
| ### 1. Clone repo and install dependencies |
|
|
| ```bash |
| git clone https://github.com/lizhaoqing/UNISON |
| cd UNISON |
| pip install -r requirements.txt |
| ``` |
|
|
| `flash-attn` is optional but strongly recommended (automatic fallback to PyTorch SDPA): |
|
|
| ```bash |
| pip install flash-attn --no-build-isolation |
| ``` |
|
|
| ### 2. MMAudio VAE weights |
|
|
| Download from the [MMAudio release](https://github.com/hkchengrex/MMAudio) and place at: |
|
|
| ``` |
| unison/models/mmaudio/data/ext_weights/ |
| v1-44.pth # 44 kHz VAE (for D20S0 / 44k variant) |
| v1-16.pth # 16 kHz VAE (for D24S0 / 16k variant) |
| best_netG.pt # BigVGAN vocoder (16 kHz VAE only) |
| ``` |
|
|
| ### 3. Qwen2.5-Omni-7B |
|
|
| ```bash |
| export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B |
| # or point to a local download: |
| export QWEN_OMNI_MODEL_PATH=/path/to/Qwen2.5-Omni-7B |
| ``` |
|
|
| ### 4. Download checkpoints (this repo) |
|
|
| ```bash |
| hf download jac22/UNISON --local-dir checkpoints |
| ``` |
|
|
| This produces: |
|
|
| ``` |
| checkpoints/ |
| unison_D20S0_O_40ch/model.safetensors # 44 kHz |
| unison_D24S0_O_20ch/model.safetensors # 16 kHz |
| ``` |
|
|
| ### 5. Run inference |
|
|
| ```bash |
| cd UNISON |
| |
| # 44 kHz variant (D20S0) |
| bash scripts/infer.sh \ |
| --checkpoint_dir checkpoints/unison_D20S0_O_40ch \ |
| --model_config unison/config/D20S0_O_40ch.yaml \ |
| --vae_config unison/models/mmaudio/vae_config_44k.yaml \ |
| --task_mode all |
| |
| # 16 kHz variant (D24S0) |
| bash scripts/infer.sh \ |
| --checkpoint_dir checkpoints/unison_D24S0_O_20ch \ |
| --model_config unison/config/D24S0_O_20ch.yaml \ |
| --vae_config unison/models/mmaudio/vae_config_16k.yaml \ |
| --task_mode all |
| ``` |
|
|
| Outputs are written to `<checkpoint_dir>/infer_<N>steps/<ckpt_name>/`. |
|
|
| ### Single-prompt example |
|
|
| ```bash |
| python unison/pipelines/infer.py \ |
| --model_ckpt checkpoints/unison_D20S0_O_40ch \ |
| --model_config unison/config/D20S0_O_40ch.yaml \ |
| --vae_config unison/models/mmaudio/vae_config_44k.yaml \ |
| --omni_model_path $QWEN_OMNI_MODEL_PATH \ |
| --task_mode generation \ |
| --gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \ |
| --gen_duration 10.0 \ |
| --output_dir outputs/demo |
| ``` |
|
|
| --- |
|
|
| ## Key inference parameters |
|
|
| | Argument | Default | Description | |
| |----------|---------|-------------| |
| | `--num_inference_steps` | 100 | ODE solver steps (50 for fast, 100 for paper quality) | |
| | `--guidance_scale` | 4.5 | Classifier-free guidance scale | |
| | `--seed` | 42 | Random seed | |
| | `--gen_duration` | 10.0 | Output length in seconds (generation tasks) | |
| | `--ref_duration` | 3.0 | Reference clip length in seconds (zero-shot TTS) | |
|
|
| --- |
|
|
| ## Checkpoint format |
|
|
| Each checkpoint is a single `model.safetensors` file (unwrapped from EMA). |
| The inference pipeline also accepts: |
|
|
| - A **directory** — auto-detects `ema_model.pt` → `model.safetensors` → `pytorch_model.bin` |
| - A **direct file path** to any of the three formats |
|
|
| EMA wrappers are unwrapped automatically at load time. |
|
|
| --- |
|
|
| ## License |
|
|
| This project is released under the **Apache 2.0 License** with additional non-commercial use |
| restrictions inherited from upstream dependencies: |
|
|
| - The backbone architecture derives from [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo/blob/main/LICENSE) |
| (Tencent), which prohibits commercial use without a separate license. |
| - Text/audio conditioning uses [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/blob/main/LICENSE) |
| (Alibaba Cloud), subject to its own license terms. |
|
|
| **This model is intended for research and non-commercial use only.** |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{li2026unison, |
| title = {UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion}, |
| author = {Li, Zhaoqing and Xu, Haoning and Su, Jingran and Liu, Yaofang and Rao, Zhefan and |
| Wang, Huimeng and Deng, Jiajun and Wang, Tianzi and Jin, Zengrui and Liu, Rui and |
| Che, Haoxuan and Liu, Xunying}, |
| journal = {arXiv preprint arXiv:2605.31530}, |
| year = {2026} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Acknowledgements |
|
|
| We thank the authors of the following works for their excellent open-source contributions: |
|
|
| - [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) — MM-DiT backbone architecture |
| - [MMAudio](https://github.com/hkchengrex/MMAudio) — audio VAE and feature utilities |
| - [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) — text/audio LLM used for deep conditioning |
| - [Ovi](https://github.com/character-ai/Ovi) (Character.AI) — inspiring cross-modal fusion design for joint audio-video generation |
|
|