--- license: apache-2.0 license_name: apache-2.0-non-commercial license_link: https://github.com/lizhaoqing/UNISON/blob/main/LICENSE language: - en - zh tags: - audio - text-to-audio - text-to-speech - zero-shot-tts - audio-editing - speech-editing - flow-matching - diffusion - mm-dit - llm-fusion library_name: custom pipeline_tag: text-to-audio arxiv: 2605.31530 --- # UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
--- UNISON is a unified latent flow-matching framework for audio and speech generation and editing. Using a **single set of weights**, it integrates text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound scene generation, and audio/speech-in-scene editing — all in one model, one architecture, one forward pass.  --- ## Model variants in this repository This repository hosts **two checkpoint variants**: | Directory | VAE | DiT depth | Channels | Config | |-----------|-----|-----------|----------|--------| | `unison_D20S0_O_40ch/` | MMAudio **44 kHz** | 20 double + 0 single | 40 | `D20S0_O_40ch.yaml` | | `unison_D24S0_O_20ch/` | MMAudio **16 kHz** | 24 double + 0 single | 20 | `D24S0_O_20ch.yaml` | Both variants share the same Qwen2.5-Omni-7B text encoder and the same inference pipeline. --- ## Supported tasks | Task | Prompt format | |------|--------------| | Text-to-Audio (T2A) | `[Audio] {caption}` | | Text-to-Speech (TTS) | `[Speech] A {female/male} voice saying "{text}"` | | Mixed Speech + Sound | `[Speech] A {gender} voice saying "{text}" [Audio] {background}` | | Zero-shot Speaker Cloning | `[Speech with voice] {ref_text}, {target_text}` | | Audio Scene Editing (add / remove / replace / denoise) | `[Edit] [Audio] {instruction}` | | Speech-in-Scene Editing (content / insert / delete) | `[Edit] [Speech] {instruction}` | | Timed Temporal Composition | `[Audio] From {t1}s to {t2}s, {event1}. From {t2}s to {t3}s, {event2}. ...` | Task identity is encoded via a **mask channel**; source/reference audio is injected through **VAE-encoded channel concatenation** — no separate encoders or task-specific heads needed. --- ## Architecture All tasks share the same VAE encoder/decoder, MM-DiT backbone, and forward pass. Text conditioning uses **layer-wise deep LLM fusion**: hidden states from uniformly sampled layers of the frozen Qwen2.5-Omni-7B backbone are injected into corresponding MM-DiT double-stream blocks via learned linear projections.  --- ## Quick start ### 1. Clone repo and install dependencies ```bash git clone https://github.com/lizhaoqing/UNISON cd UNISON pip install -r requirements.txt ``` `flash-attn` is optional but strongly recommended (automatic fallback to PyTorch SDPA): ```bash pip install flash-attn --no-build-isolation ``` ### 2. MMAudio VAE weights Download from the [MMAudio release](https://github.com/hkchengrex/MMAudio) and place at: ``` unison/models/mmaudio/data/ext_weights/ v1-44.pth # 44 kHz VAE (for D20S0 / 44k variant) v1-16.pth # 16 kHz VAE (for D24S0 / 16k variant) best_netG.pt # BigVGAN vocoder (16 kHz VAE only) ``` ### 3. Qwen2.5-Omni-7B ```bash export QWEN_OMNI_MODEL_PATH=Qwen/Qwen2.5-Omni-7B # or point to a local download: export QWEN_OMNI_MODEL_PATH=/path/to/Qwen2.5-Omni-7B ``` ### 4. Download checkpoints (this repo) ```bash hf download jac22/UNISON --local-dir checkpoints ``` This produces: ``` checkpoints/ unison_D20S0_O_40ch/model.safetensors # 44 kHz unison_D24S0_O_20ch/model.safetensors # 16 kHz ``` ### 5. Run inference ```bash cd UNISON # 44 kHz variant (D20S0) bash scripts/infer.sh \ --checkpoint_dir checkpoints/unison_D20S0_O_40ch \ --model_config unison/config/D20S0_O_40ch.yaml \ --vae_config unison/models/mmaudio/vae_config_44k.yaml \ --task_mode all # 16 kHz variant (D24S0) bash scripts/infer.sh \ --checkpoint_dir checkpoints/unison_D24S0_O_20ch \ --model_config unison/config/D24S0_O_20ch.yaml \ --vae_config unison/models/mmaudio/vae_config_16k.yaml \ --task_mode all ``` Outputs are written to `