Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +2 -0
README.md +230 -0
fig1.png +3 -0
fig2.png +3 -0
unison_D20S0_O_40ch/model.safetensors +3 -0
unison_D24S0_O_20ch/model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+fig1.png filter=lfs diff=lfs merge=lfs -text
+fig2.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,233 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+license_name: apache-2.0-non-commercial
+license_link: https://github.com/lizhaoqing/UNISON/blob/main/LICENSE
+language:
+  - en
+  - zh
+tags:
+  - audio
+  - text-to-audio
+  - text-to-speech
+  - zero-shot-tts
+  - audio-editing
+  - speech-editing
+  - flow-matching
+  - diffusion
+  - mm-dit
+  - llm-fusion
+library_name: custom
+pipeline_tag: text-to-audio
+arxiv: 2605.31530
 ---
+# UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
+**Paper:** [arXiv:2605.31530](https://arxiv.org/abs/2605.31530) &nbsp;|&nbsp;
+**Code:** [github.com/lizhaoqing/UNISON](https://github.com/lizhaoqing/UNISON) &nbsp;|&nbsp;
+**Demo:** [Project Page](https://yourusername.github.io/unison)
+[![arXiv](https://img.shields.io/badge/arXiv-2605.31530-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2605.31530)
+[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white)](https://github.com/lizhaoqing/UNISON)
+[![License](https://img.shields.io/badge/License-Apache%202.0%20(Non--Commercial)-blue.svg)](https://github.com/lizhaoqing/UNISON/blob/main/LICENSE)
+---
+UNISON is a unified latent flow-matching framework for audio and speech generation and editing.
+Using a **single set of weights**, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning,
+mixed speech-and-sound scene generation, and audio/speech-in-scene editing — all in one model, one architecture, one forward pass.
+![UNISON Overview](fig1.png)
+---
+## Model variants in this repository
+This repository hosts **two checkpoint variants**:
+| Directory | VAE | DiT depth | Channels | Config |
+|-----------|-----|-----------|----------|--------|
+| `unison_D20S0_O_40ch/` | MMAudio **44 kHz** | 20 double + 0 single | 40 | `D20S0_O_40ch.yaml` |
+| `unison_D24S0_O_20ch/` | MMAudio **16 kHz** | 24 double + 0 single | 20 | `D24S0_O_20ch.yaml` |
+Both variants share the same Qwen2.5-Omni-7B text encoder and the same inference pipeline.
+---
+## Supported tasks
+| Task | Prompt format |
+|------|--------------|
+| Text-to-Audio (T2A) | `[Audio] {caption}` |
+| Text-to-Speech (TTS) | `[Speech] A {female/male} voice saying "{text}"` |
+| Mixed Speech + Sound | `[Speech] A {gender} voice saying "{text}" [Audio] {background}` |
+| Zero-shot Speaker Cloning | `[Speech with voice] {ref_text}, {target_text}` |
+| Audio Scene Editing (add / remove / replace / denoise) | `[Edit] [Audio] {instruction}` |
+| Speech-in-Scene Editing (content / insert / delete) | `[Edit] [Speech] {instruction}` |
+| Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. From {t2}s to {t3}s, {event2}. ...` |
+Task identity is encoded via a **mask channel**; source/reference audio is injected through
+**VAE-encoded channel concatenation** — no separate encoders or task-specific heads needed.
+---
+## Architecture
+All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass.
+Text conditioning uses **layer-wise deep LLM fusion**: hidden states from uniformly sampled layers
+of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks
+via learned linear projections.
+![UNISON Architecture](fig2.png)
+---
+## Quick start
+### 1. Clone repo and install dependencies
+```bash
+git clone https://github.com/lizhaoqing/UNISON
+cd UNISON
+pip install -r requirements.txt
+```
+`flash-attn` is optional but strongly recommended (automatic fallback to PyTorch SDPA):
+```bash
+pip install flash-attn --no-build-isolation
+```
+### 2. MMAudio VAE weights
+Download from the [MMAudio release](https://github.com/hkchengrex/MMAudio) and place at:
+```
+unison/models/mmaudio/data/ext_weights/
+    v1-44.pth       # 44 kHz VAE  (for D20S0 / 44k variant)
+    v1-16.pth       # 16 kHz VAE  (for D24S0 / 16k variant)
+    best_netG.pt    # BigVGAN vocoder  (16 kHz VAE only)
+```
+### 3. Qwen2.5-Omni-7B
+```bash
+export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B
+# or point to a local download:
+export QWEN_OMNI_MODEL_PATH=/path/to/Qwen2.5-Omni-7B
+```
+### 4. Download checkpoints (this repo)
+```python
+from huggingface_hub import snapshot_download
+snapshot_download(repo_id="jac22/UNISON", local_dir="checkpoints")
+```
+This produces:
+```
+checkpoints/
+    unison_D20S0_O_40ch/model.safetensors   # 44 kHz
+    unison_D24S0_O_20ch/model.safetensors   # 16 kHz
+```
+### 5. Run inference
+```bash
+cd UNISON
+# 44 kHz variant (D20S0)
+bash scripts/infer.sh \
+    --checkpoint_dir checkpoints/unison_D20S0_O_40ch \
+    --model_config   unison/config/D20S0_O_40ch.yaml \
+    --vae_config     unison/models/mmaudio/vae_config_44k.yaml \
+    --task_mode      all
+# 16 kHz variant (D24S0)
+bash scripts/infer.sh \
+    --checkpoint_dir checkpoints/unison_D24S0_O_20ch \
+    --model_config   unison/config/D24S0_O_20ch.yaml \
+    --vae_config     unison/models/mmaudio/vae_config_16k.yaml \
+    --task_mode      all
+```
+Outputs are written to `<checkpoint_dir>/infer_<N>steps/<ckpt_name>/`.
+### Single-prompt example
+```bash
+python unison/pipelines/infer.py \
+  --model_ckpt      checkpoints/unison_D20S0_O_40ch \
+  --model_config    unison/config/D20S0_O_40ch.yaml \
+  --vae_config      unison/models/mmaudio/vae_config_44k.yaml \
+  --omni_model_path $QWEN_OMNI_MODEL_PATH \
+  --task_mode       generation \
+  --gen_prompt      "[Audio] Rain falling on a tin roof with distant thunder" \
+  --gen_duration    10.0 \
+  --output_dir      outputs/demo
+```
+---
+## Key inference parameters
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--num_inference_steps` | 100 | ODE solver steps (50 for fast, 100 for paper quality) |
+| `--guidance_scale` | 4.5 | Classifier-free guidance scale |
+| `--seed` | 42 | Random seed |
+| `--gen_duration` | 10.0 | Output length in seconds (generation tasks) |
+| `--ref_duration` | 3.0 | Reference clip length in seconds (zero-shot TTS) |
+---
+## Checkpoint format
+Each checkpoint is a single `model.safetensors` file (unwrapped from EMA).
+The inference pipeline also accepts:
+- A **directory** — auto-detects `ema_model.pt` → `model.safetensors` → `pytorch_model.bin`
+- A **direct file path** to any of the three formats
+EMA wrappers are unwrapped automatically at load time.
+---
+## License
+This project is released under the **Apache 2.0 License** with additional non-commercial use
+restrictions inherited from upstream dependencies:
+- The backbone architecture derives from [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo/blob/main/LICENSE)
+  (Tencent), which prohibits commercial use without a separate license.
+- Text/audio conditioning uses [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/blob/main/LICENSE)
+  (Alibaba Cloud), subject to its own license terms.
+**This model is intended for research and non-commercial use only.**
+---
+## Citation
+```bibtex
+@article{li2026unison,
+  title   = {UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion},
+  author  = {Li, Zhaoqing and Xu, Haoning and Su, Jingran and Liu, Yaofang and Rao, Zhefan and
+             Wang, Huimeng and Deng, Jiajun and Wang, Tianzi and Jin, Zengrui and Liu, Rui and
+             Che, Haoxuan and Liu, Xunying},
+  journal = {arXiv preprint arXiv:2605.31530},
+  year    = {2026}
+}
+```
+---
+## Acknowledgements
+We thank the authors of the following works for their excellent open-source contributions:
+- [HunyuanVideo](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) — MM-DiT backbone architecture
+- [MMAudio](https://github.com/hkchengrex/MMAudio) — audio VAE and feature utilities
+- [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) — text/audio LLM used for deep conditioning
+- [Ovi](https://github.com/character-ai/Ovi) (Character.AI) — inspiring cross-modal fusion design for joint audio-video generation

fig1.png ADDED Viewed

Git LFS Details

SHA256: 1ceb89b16273ac29fa8f02faf9a183bbcd6f45b49f2ef4b2ac65e44d52b06f42
Pointer size: 132 Bytes
Size of remote file: 3.15 MB

fig2.png ADDED Viewed

Git LFS Details

SHA256: c34b5d5b358a7099ddd5efa6e84cca497aab5d140eee860c4debef0ff8eb440a
Pointer size: 131 Bytes
Size of remote file: 168 kB

unison_D20S0_O_40ch/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9af8f170d11dea3f6e316d0236c68a1ecab206a8e64a725fd9256e7f6b5b9c3c
+size 2483163600

unison_D24S0_O_20ch/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:26d2a7099f831a7f53429eabf98f2b85cf593e348f19f49af34be17098694b52
+size 2926895464