jac22 commited on
Commit
fdbea64
·
verified ·
1 Parent(s): 7bdbb3e

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ fig1.png filter=lfs diff=lfs merge=lfs -text
37
+ fig2.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,233 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ license_name: apache-2.0-non-commercial
4
+ license_link: https://github.com/lizhaoqing/UNISON/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ - zh
8
+ tags:
9
+ - audio
10
+ - text-to-audio
11
+ - text-to-speech
12
+ - zero-shot-tts
13
+ - audio-editing
14
+ - speech-editing
15
+ - flow-matching
16
+ - diffusion
17
+ - mm-dit
18
+ - llm-fusion
19
+ library_name: custom
20
+ pipeline_tag: text-to-audio
21
+ arxiv: 2605.31530
22
  ---
23
+
24
+ # UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
25
+
26
+ **Paper:** [arXiv:2605.31530](https://arxiv.org/abs/2605.31530)  | 
27
+ **Code:** [github.com/lizhaoqing/UNISON](https://github.com/lizhaoqing/UNISON)  | 
28
+ **Demo:** [Project Page](https://yourusername.github.io/unison)
29
+
30
+ [![arXiv](https://img.shields.io/badge/arXiv-2605.31530-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2605.31530)
31
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white)](https://github.com/lizhaoqing/UNISON)
32
+ [![License](https://img.shields.io/badge/License-Apache%202.0%20(Non--Commercial)-blue.svg)](https://github.com/lizhaoqing/UNISON/blob/main/LICENSE)
33
+
34
+ ---
35
+
36
+ UNISON is a unified latent flow-matching framework for audio and speech generation and editing.
37
+ Using a **single set of weights**, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning,
38
+ mixed speech-and-sound scene generation, and audio/speech-in-scene editing — all in one model, one architecture, one forward pass.
39
+
40
+ ![UNISON Overview](fig1.png)
41
+
42
+ ---
43
+
44
+ ## Model variants in this repository
45
+
46
+ This repository hosts **two checkpoint variants**:
47
+
48
+ | Directory | VAE | DiT depth | Channels | Config |
49
+ |-----------|-----|-----------|----------|--------|
50
+ | `unison_D20S0_O_40ch/` | MMAudio **44 kHz** | 20 double + 0 single | 40 | `D20S0_O_40ch.yaml` |
51
+ | `unison_D24S0_O_20ch/` | MMAudio **16 kHz** | 24 double + 0 single | 20 | `D24S0_O_20ch.yaml` |
52
+
53
+ Both variants share the same Qwen2.5-Omni-7B text encoder and the same inference pipeline.
54
+
55
+ ---
56
+
57
+ ## Supported tasks
58
+
59
+ | Task | Prompt format |
60
+ |------|--------------|
61
+ | Text-to-Audio (T2A) | `[Audio] {caption}` |
62
+ | Text-to-Speech (TTS) | `[Speech] A {female/male} voice saying "{text}"` |
63
+ | Mixed Speech + Sound | `[Speech] A {gender} voice saying "{text}" [Audio] {background}` |
64
+ | Zero-shot Speaker Cloning | `[Speech with voice] {ref_text}, {target_text}` |
65
+ | Audio Scene Editing (add / remove / replace / denoise) | `[Edit] [Audio] {instruction}` |
66
+ | Speech-in-Scene Editing (content / insert / delete) | `[Edit] [Speech] {instruction}` |
67
+ | Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. From {t2}s to {t3}s, {event2}. ...` |
68
+
69
+ Task identity is encoded via a **mask channel**; source/reference audio is injected through
70
+ **VAE-encoded channel concatenation** — no separate encoders or task-specific heads needed.
71
+
72
+ ---
73
+
74
+ ## Architecture
75
+
76
+ All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass.
77
+ Text conditioning uses **layer-wise deep LLM fusion**: hidden states from uniformly sampled layers
78
+ of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks
79
+ via learned linear projections.
80
+
81
+ ![UNISON Architecture](fig2.png)
82
+
83
+ ---
84
+
85
+ ## Quick start
86
+
87
+ ### 1. Clone repo and install dependencies
88
+
89
+ ```bash
90
+ git clone https://github.com/lizhaoqing/UNISON
91
+ cd UNISON
92
+ pip install -r requirements.txt
93
+ ```
94
+
95
+ `flash-attn` is optional but strongly recommended (automatic fallback to PyTorch SDPA):
96
+
97
+ ```bash
98
+ pip install flash-attn --no-build-isolation
99
+ ```
100
+
101
+ ### 2. MMAudio VAE weights
102
+
103
+ Download from the [MMAudio release](https://github.com/hkchengrex/MMAudio) and place at:
104
+
105
+ ```
106
+ unison/models/mmaudio/data/ext_weights/
107
+ v1-44.pth # 44 kHz VAE (for D20S0 / 44k variant)
108
+ v1-16.pth # 16 kHz VAE (for D24S0 / 16k variant)
109
+ best_netG.pt # BigVGAN vocoder (16 kHz VAE only)
110
+ ```
111
+
112
+ ### 3. Qwen2.5-Omni-7B
113
+
114
+ ```bash
115
+ export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B
116
+ # or point to a local download:
117
+ export QWEN_OMNI_MODEL_PATH=/path/to/Qwen2.5-Omni-7B
118
+ ```
119
+
120
+ ### 4. Download checkpoints (this repo)
121
+
122
+ ```python
123
+ from huggingface_hub import snapshot_download
124
+ snapshot_download(repo_id="jac22/UNISON", local_dir="checkpoints")
125
+ ```
126
+
127
+ This produces:
128
+
129
+ ```
130
+ checkpoints/
131
+ unison_D20S0_O_40ch/model.safetensors # 44 kHz
132
+ unison_D24S0_O_20ch/model.safetensors # 16 kHz
133
+ ```
134
+
135
+ ### 5. Run inference
136
+
137
+ ```bash
138
+ cd UNISON
139
+
140
+ # 44 kHz variant (D20S0)
141
+ bash scripts/infer.sh \
142
+ --checkpoint_dir checkpoints/unison_D20S0_O_40ch \
143
+ --model_config unison/config/D20S0_O_40ch.yaml \
144
+ --vae_config unison/models/mmaudio/vae_config_44k.yaml \
145
+ --task_mode all
146
+
147
+ # 16 kHz variant (D24S0)
148
+ bash scripts/infer.sh \
149
+ --checkpoint_dir checkpoints/unison_D24S0_O_20ch \
150
+ --model_config unison/config/D24S0_O_20ch.yaml \
151
+ --vae_config unison/models/mmaudio/vae_config_16k.yaml \
152
+ --task_mode all
153
+ ```
154
+
155
+ Outputs are written to `<checkpoint_dir>/infer_<N>steps/<ckpt_name>/`.
156
+
157
+ ### Single-prompt example
158
+
159
+ ```bash
160
+ python unison/pipelines/infer.py \
161
+ --model_ckpt checkpoints/unison_D20S0_O_40ch \
162
+ --model_config unison/config/D20S0_O_40ch.yaml \
163
+ --vae_config unison/models/mmaudio/vae_config_44k.yaml \
164
+ --omni_model_path $QWEN_OMNI_MODEL_PATH \
165
+ --task_mode generation \
166
+ --gen_prompt "[Audio] Rain falling on a tin roof with distant thunder" \
167
+ --gen_duration 10.0 \
168
+ --output_dir outputs/demo
169
+ ```
170
+
171
+ ---
172
+
173
+ ## Key inference parameters
174
+
175
+ | Argument | Default | Description |
176
+ |----------|---------|-------------|
177
+ | `--num_inference_steps` | 100 | ODE solver steps (50 for fast, 100 for paper quality) |
178
+ | `--guidance_scale` | 4.5 | Classifier-free guidance scale |
179
+ | `--seed` | 42 | Random seed |
180
+ | `--gen_duration` | 10.0 | Output length in seconds (generation tasks) |
181
+ | `--ref_duration` | 3.0 | Reference clip length in seconds (zero-shot TTS) |
182
+
183
+ ---
184
+
185
+ ## Checkpoint format
186
+
187
+ Each checkpoint is a single `model.safetensors` file (unwrapped from EMA).
188
+ The inference pipeline also accepts:
189
+
190
+ - A **directory** — auto-detects `ema_model.pt` → `model.safetensors` → `pytorch_model.bin`
191
+ - A **direct file path** to any of the three formats
192
+
193
+ EMA wrappers are unwrapped automatically at load time.
194
+
195
+ ---
196
+
197
+ ## License
198
+
199
+ This project is released under the **Apache 2.0 License** with additional non-commercial use
200
+ restrictions inherited from upstream dependencies:
201
+
202
+ - The backbone architecture derives from [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo/blob/main/LICENSE)
203
+ (Tencent), which prohibits commercial use without a separate license.
204
+ - Text/audio conditioning uses [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/blob/main/LICENSE)
205
+ (Alibaba Cloud), subject to its own license terms.
206
+
207
+ **This model is intended for research and non-commercial use only.**
208
+
209
+ ---
210
+
211
+ ## Citation
212
+
213
+ ```bibtex
214
+ @article{li2026unison,
215
+ title = {UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion},
216
+ author = {Li, Zhaoqing and Xu, Haoning and Su, Jingran and Liu, Yaofang and Rao, Zhefan and
217
+ Wang, Huimeng and Deng, Jiajun and Wang, Tianzi and Jin, Zengrui and Liu, Rui and
218
+ Che, Haoxuan and Liu, Xunying},
219
+ journal = {arXiv preprint arXiv:2605.31530},
220
+ year = {2026}
221
+ }
222
+ ```
223
+
224
+ ---
225
+
226
+ ## Acknowledgements
227
+
228
+ We thank the authors of the following works for their excellent open-source contributions:
229
+
230
+ - [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) — MM-DiT backbone architecture
231
+ - [MMAudio](https://github.com/hkchengrex/MMAudio) — audio VAE and feature utilities
232
+ - [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) — text/audio LLM used for deep conditioning
233
+ - [Ovi](https://github.com/character-ai/Ovi) (Character.AI) — inspiring cross-modal fusion design for joint audio-video generation
fig1.png ADDED

Git LFS Details

  • SHA256: 1ceb89b16273ac29fa8f02faf9a183bbcd6f45b49f2ef4b2ac65e44d52b06f42
  • Pointer size: 132 Bytes
  • Size of remote file: 3.15 MB
fig2.png ADDED

Git LFS Details

  • SHA256: c34b5d5b358a7099ddd5efa6e84cca497aab5d140eee860c4debef0ff8eb440a
  • Pointer size: 131 Bytes
  • Size of remote file: 168 kB
unison_D20S0_O_40ch/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9af8f170d11dea3f6e316d0236c68a1ecab206a8e64a725fd9256e7f6b5b9c3c
3
+ size 2483163600
unison_D24S0_O_20ch/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26d2a7099f831a7f53429eabf98f2b85cf593e348f19f49af34be17098694b52
3
+ size 2926895464