Upload folder using huggingface_hub
Browse files- .gitattributes +2 -0
- README.md +230 -0
- fig1.png +3 -0
- fig2.png +3 -0
- unison_D20S0_O_40ch/model.safetensors +3 -0
- unison_D24S0_O_20ch/model.safetensors +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
fig1.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
fig2.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,233 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
license_name: apache-2.0-non-commercial
|
| 4 |
+
license_link: https://github.com/lizhaoqing/UNISON/blob/main/LICENSE
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- zh
|
| 8 |
+
tags:
|
| 9 |
+
- audio
|
| 10 |
+
- text-to-audio
|
| 11 |
+
- text-to-speech
|
| 12 |
+
- zero-shot-tts
|
| 13 |
+
- audio-editing
|
| 14 |
+
- speech-editing
|
| 15 |
+
- flow-matching
|
| 16 |
+
- diffusion
|
| 17 |
+
- mm-dit
|
| 18 |
+
- llm-fusion
|
| 19 |
+
library_name: custom
|
| 20 |
+
pipeline_tag: text-to-audio
|
| 21 |
+
arxiv: 2605.31530
|
| 22 |
---
|
| 23 |
+
|
| 24 |
+
# UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
|
| 25 |
+
|
| 26 |
+
**Paper:** [arXiv:2605.31530](https://arxiv.org/abs/2605.31530) |
|
| 27 |
+
**Code:** [github.com/lizhaoqing/UNISON](https://github.com/lizhaoqing/UNISON) |
|
| 28 |
+
**Demo:** [Project Page](https://yourusername.github.io/unison)
|
| 29 |
+
|
| 30 |
+
[](https://arxiv.org/abs/2605.31530)
|
| 31 |
+
[](https://github.com/lizhaoqing/UNISON)
|
| 32 |
+
[-blue.svg)](https://github.com/lizhaoqing/UNISON/blob/main/LICENSE)
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
UNISON is a unified latent flow-matching framework for audio and speech generation and editing.
|
| 37 |
+
Using a **single set of weights**, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning,
|
| 38 |
+
mixed speech-and-sound scene generation, and audio/speech-in-scene editing — all in one model, one architecture, one forward pass.
|
| 39 |
+
|
| 40 |
+

|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## Model variants in this repository
|
| 45 |
+
|
| 46 |
+
This repository hosts **two checkpoint variants**:
|
| 47 |
+
|
| 48 |
+
| Directory | VAE | DiT depth | Channels | Config |
|
| 49 |
+
|-----------|-----|-----------|----------|--------|
|
| 50 |
+
| `unison_D20S0_O_40ch/` | MMAudio **44 kHz** | 20 double + 0 single | 40 | `D20S0_O_40ch.yaml` |
|
| 51 |
+
| `unison_D24S0_O_20ch/` | MMAudio **16 kHz** | 24 double + 0 single | 20 | `D24S0_O_20ch.yaml` |
|
| 52 |
+
|
| 53 |
+
Both variants share the same Qwen2.5-Omni-7B text encoder and the same inference pipeline.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Supported tasks
|
| 58 |
+
|
| 59 |
+
| Task | Prompt format |
|
| 60 |
+
|------|--------------|
|
| 61 |
+
| Text-to-Audio (T2A) | `[Audio] {caption}` |
|
| 62 |
+
| Text-to-Speech (TTS) | `[Speech] A {female/male} voice saying "{text}"` |
|
| 63 |
+
| Mixed Speech + Sound | `[Speech] A {gender} voice saying "{text}" [Audio] {background}` |
|
| 64 |
+
| Zero-shot Speaker Cloning | `[Speech with voice] {ref_text}, {target_text}` |
|
| 65 |
+
| Audio Scene Editing (add / remove / replace / denoise) | `[Edit] [Audio] {instruction}` |
|
| 66 |
+
| Speech-in-Scene Editing (content / insert / delete) | `[Edit] [Speech] {instruction}` |
|
| 67 |
+
| Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. From {t2}s to {t3}s, {event2}. ...` |
|
| 68 |
+
|
| 69 |
+
Task identity is encoded via a **mask channel**; source/reference audio is injected through
|
| 70 |
+
**VAE-encoded channel concatenation** — no separate encoders or task-specific heads needed.
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Architecture
|
| 75 |
+
|
| 76 |
+
All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass.
|
| 77 |
+
Text conditioning uses **layer-wise deep LLM fusion**: hidden states from uniformly sampled layers
|
| 78 |
+
of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks
|
| 79 |
+
via learned linear projections.
|
| 80 |
+
|
| 81 |
+

|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Quick start
|
| 86 |
+
|
| 87 |
+
### 1. Clone repo and install dependencies
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
git clone https://github.com/lizhaoqing/UNISON
|
| 91 |
+
cd UNISON
|
| 92 |
+
pip install -r requirements.txt
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
`flash-attn` is optional but strongly recommended (automatic fallback to PyTorch SDPA):
|
| 96 |
+
|
| 97 |
+
```bash
|
| 98 |
+
pip install flash-attn --no-build-isolation
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### 2. MMAudio VAE weights
|
| 102 |
+
|
| 103 |
+
Download from the [MMAudio release](https://github.com/hkchengrex/MMAudio) and place at:
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
unison/models/mmaudio/data/ext_weights/
|
| 107 |
+
v1-44.pth # 44 kHz VAE (for D20S0 / 44k variant)
|
| 108 |
+
v1-16.pth # 16 kHz VAE (for D24S0 / 16k variant)
|
| 109 |
+
best_netG.pt # BigVGAN vocoder (16 kHz VAE only)
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
### 3. Qwen2.5-Omni-7B
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B
|
| 116 |
+
# or point to a local download:
|
| 117 |
+
export QWEN_OMNI_MODEL_PATH=/path/to/Qwen2.5-Omni-7B
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
### 4. Download checkpoints (this repo)
|
| 121 |
+
|
| 122 |
+
```python
|
| 123 |
+
from huggingface_hub import snapshot_download
|
| 124 |
+
snapshot_download(repo_id="jac22/UNISON", local_dir="checkpoints")
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
This produces:
|
| 128 |
+
|
| 129 |
+
```
|
| 130 |
+
checkpoints/
|
| 131 |
+
unison_D20S0_O_40ch/model.safetensors # 44 kHz
|
| 132 |
+
unison_D24S0_O_20ch/model.safetensors # 16 kHz
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### 5. Run inference
|
| 136 |
+
|
| 137 |
+
```bash
|
| 138 |
+
cd UNISON
|
| 139 |
+
|
| 140 |
+
# 44 kHz variant (D20S0)
|
| 141 |
+
bash scripts/infer.sh \
|
| 142 |
+
--checkpoint_dir checkpoints/unison_D20S0_O_40ch \
|
| 143 |
+
--model_config unison/config/D20S0_O_40ch.yaml \
|
| 144 |
+
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
|
| 145 |
+
--task_mode all
|
| 146 |
+
|
| 147 |
+
# 16 kHz variant (D24S0)
|
| 148 |
+
bash scripts/infer.sh \
|
| 149 |
+
--checkpoint_dir checkpoints/unison_D24S0_O_20ch \
|
| 150 |
+
--model_config unison/config/D24S0_O_20ch.yaml \
|
| 151 |
+
--vae_config unison/models/mmaudio/vae_config_16k.yaml \
|
| 152 |
+
--task_mode all
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
Outputs are written to `<checkpoint_dir>/infer_<N>steps/<ckpt_name>/`.
|
| 156 |
+
|
| 157 |
+
### Single-prompt example
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
python unison/pipelines/infer.py \
|
| 161 |
+
--model_ckpt checkpoints/unison_D20S0_O_40ch \
|
| 162 |
+
--model_config unison/config/D20S0_O_40ch.yaml \
|
| 163 |
+
--vae_config unison/models/mmaudio/vae_config_44k.yaml \
|
| 164 |
+
--omni_model_path $QWEN_OMNI_MODEL_PATH \
|
| 165 |
+
--task_mode generation \
|
| 166 |
+
--gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \
|
| 167 |
+
--gen_duration 10.0 \
|
| 168 |
+
--output_dir outputs/demo
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## Key inference parameters
|
| 174 |
+
|
| 175 |
+
| Argument | Default | Description |
|
| 176 |
+
|----------|---------|-------------|
|
| 177 |
+
| `--num_inference_steps` | 100 | ODE solver steps (50 for fast, 100 for paper quality) |
|
| 178 |
+
| `--guidance_scale` | 4.5 | Classifier-free guidance scale |
|
| 179 |
+
| `--seed` | 42 | Random seed |
|
| 180 |
+
| `--gen_duration` | 10.0 | Output length in seconds (generation tasks) |
|
| 181 |
+
| `--ref_duration` | 3.0 | Reference clip length in seconds (zero-shot TTS) |
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## Checkpoint format
|
| 186 |
+
|
| 187 |
+
Each checkpoint is a single `model.safetensors` file (unwrapped from EMA).
|
| 188 |
+
The inference pipeline also accepts:
|
| 189 |
+
|
| 190 |
+
- A **directory** — auto-detects `ema_model.pt` → `model.safetensors` → `pytorch_model.bin`
|
| 191 |
+
- A **direct file path** to any of the three formats
|
| 192 |
+
|
| 193 |
+
EMA wrappers are unwrapped automatically at load time.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## License
|
| 198 |
+
|
| 199 |
+
This project is released under the **Apache 2.0 License** with additional non-commercial use
|
| 200 |
+
restrictions inherited from upstream dependencies:
|
| 201 |
+
|
| 202 |
+
- The backbone architecture derives from [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo/blob/main/LICENSE)
|
| 203 |
+
(Tencent), which prohibits commercial use without a separate license.
|
| 204 |
+
- Text/audio conditioning uses [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/blob/main/LICENSE)
|
| 205 |
+
(Alibaba Cloud), subject to its own license terms.
|
| 206 |
+
|
| 207 |
+
**This model is intended for research and non-commercial use only.**
|
| 208 |
+
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
## Citation
|
| 212 |
+
|
| 213 |
+
```bibtex
|
| 214 |
+
@article{li2026unison,
|
| 215 |
+
title = {UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion},
|
| 216 |
+
author = {Li, Zhaoqing and Xu, Haoning and Su, Jingran and Liu, Yaofang and Rao, Zhefan and
|
| 217 |
+
Wang, Huimeng and Deng, Jiajun and Wang, Tianzi and Jin, Zengrui and Liu, Rui and
|
| 218 |
+
Che, Haoxuan and Liu, Xunying},
|
| 219 |
+
journal = {arXiv preprint arXiv:2605.31530},
|
| 220 |
+
year = {2026}
|
| 221 |
+
}
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## Acknowledgements
|
| 227 |
+
|
| 228 |
+
We thank the authors of the following works for their excellent open-source contributions:
|
| 229 |
+
|
| 230 |
+
- [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) — MM-DiT backbone architecture
|
| 231 |
+
- [MMAudio](https://github.com/hkchengrex/MMAudio) — audio VAE and feature utilities
|
| 232 |
+
- [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) — text/audio LLM used for deep conditioning
|
| 233 |
+
- [Ovi](https://github.com/character-ai/Ovi) (Character.AI) — inspiring cross-modal fusion design for joint audio-video generation
|
fig1.png
ADDED
|
Git LFS Details
|
fig2.png
ADDED
|
Git LFS Details
|
unison_D20S0_O_40ch/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9af8f170d11dea3f6e316d0236c68a1ecab206a8e64a725fd9256e7f6b5b9c3c
|
| 3 |
+
size 2483163600
|
unison_D24S0_O_20ch/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:26d2a7099f831a7f53429eabf98f2b85cf593e348f19f49af34be17098694b52
|
| 3 |
+
size 2926895464
|