update
Browse files- README.md +160 -10
- model.ckpt +0 -3
README.md
CHANGED
|
@@ -6,32 +6,182 @@ tags:
|
|
| 6 |
- speech
|
| 7 |
- diffusion
|
| 8 |
- multimodal
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
# Audio-Omni
|
| 12 |
|
| 13 |
**Unified Audio Understanding, Generation, and Editing** (SIGGRAPH 2026)
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
**
|
| 20 |
|
| 21 |
-
##
|
| 22 |
|
| 23 |
-
-
|
| 24 |
-
-
|
|
|
|
| 25 |
|
| 26 |
-
## Files
|
| 27 |
|
| 28 |
- `Audio-Omni.json` β Model configuration
|
| 29 |
- `model.ckpt` β Model checkpoint (~21 GB)
|
|
|
|
| 30 |
|
| 31 |
-
## Quick Start
|
|
|
|
|
|
|
| 32 |
|
| 33 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
|
| 35 |
```
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- speech
|
| 7 |
- diffusion
|
| 8 |
- multimodal
|
| 9 |
+
- audio-generation
|
| 10 |
+
- audio-editing
|
| 11 |
+
- video-to-audio
|
| 12 |
+
- text-to-speech
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# ποΈ Audio-Omni
|
| 16 |
|
| 17 |
**Unified Audio Understanding, Generation, and Editing** (SIGGRAPH 2026)
|
| 18 |
|
| 19 |
+
[](https://github.com/ZeyueT/Audio-Omni)
|
| 20 |
+
[](https://zeyuet.github.io/Audio-Omni/)
|
| 21 |
+
[](https://arxiv.org/abs/XXXX.XXXXX)
|
| 22 |
|
| 23 |
+
## π Overview
|
| 24 |
|
| 25 |
+
Audio-Omni is the first end-to-end framework that unifies **understanding**, **generation**, and **editing** across general sound, music, and speech domains. It combines a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
|
| 26 |
|
| 27 |
+
## π― Capabilities
|
| 28 |
|
| 29 |
+
- **Understanding**: Audio/video captioning, question answering
|
| 30 |
+
- **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
|
| 31 |
+
- **Editing**: Add, Remove, Extract, Style Transfer
|
| 32 |
|
| 33 |
+
## π¦ Model Files
|
| 34 |
|
| 35 |
- `Audio-Omni.json` β Model configuration
|
| 36 |
- `model.ckpt` β Model checkpoint (~21 GB)
|
| 37 |
+
- `synchformer_state_dict.pth` β Synchformer checkpoint for video conditioning
|
| 38 |
|
| 39 |
+
## π Quick Start
|
| 40 |
+
|
| 41 |
+
### Installation
|
| 42 |
|
| 43 |
```bash
|
| 44 |
+
# Clone the GitHub repository
|
| 45 |
+
git clone https://github.com/ZeyueT/Audio-Omni.git
|
| 46 |
+
cd Audio-Omni
|
| 47 |
+
|
| 48 |
+
# Install dependencies
|
| 49 |
+
pip install -e .
|
| 50 |
+
conda install -c conda-forge ffmpeg libsndfile
|
| 51 |
+
|
| 52 |
+
# Download model from Hugging Face
|
| 53 |
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
|
| 54 |
```
|
| 55 |
|
| 56 |
+
### Python API
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from audio_omni import AudioOmni
|
| 60 |
+
import torchaudio
|
| 61 |
+
|
| 62 |
+
# Load model
|
| 63 |
+
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### 1οΈβ£ Understanding
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
# Audio understanding
|
| 70 |
+
response = model.understand(
|
| 71 |
+
"Describe the sounds in this audio.",
|
| 72 |
+
audio="example.wav"
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
# Video understanding
|
| 76 |
+
response = model.understand(
|
| 77 |
+
"What is happening in this video?",
|
| 78 |
+
video="example.mp4"
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
# Audio + Video understanding
|
| 82 |
+
response = model.understand(
|
| 83 |
+
"Does the audio match the video?",
|
| 84 |
+
audio="example.wav",
|
| 85 |
+
video="example.mp4"
|
| 86 |
+
)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### 2οΈβ£ Generation
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
# Text-to-Audio
|
| 93 |
+
audio = model.generate("T2A", prompt="A clock ticking.")
|
| 94 |
+
torchaudio.save("output.wav", audio, model.sample_rate)
|
| 95 |
+
|
| 96 |
+
# Text-to-Music
|
| 97 |
+
audio = model.generate(
|
| 98 |
+
"T2M",
|
| 99 |
+
prompt="Compose a bright jazz swing instrumental with walking bass."
|
| 100 |
+
)
|
| 101 |
+
torchaudio.save("music.wav", audio, model.sample_rate)
|
| 102 |
+
|
| 103 |
+
# Video-to-Audio
|
| 104 |
+
audio = model.generate("V2A", video_path="example.mp4")
|
| 105 |
+
torchaudio.save("v2a_output.wav", audio, model.sample_rate)
|
| 106 |
+
|
| 107 |
+
# Text-to-Speech
|
| 108 |
+
audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
|
| 109 |
+
torchaudio.save("tts_output.wav", audio, model.sample_rate)
|
| 110 |
+
|
| 111 |
+
# Text-to-Speech with voice cloning
|
| 112 |
+
audio = model.generate(
|
| 113 |
+
"TTS",
|
| 114 |
+
prompt="Hello, welcome to Audio-Omni.",
|
| 115 |
+
voice_prompt_path="ref_voice.wav",
|
| 116 |
+
voice_ref_text="This is the reference transcript."
|
| 117 |
+
)
|
| 118 |
+
torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
### 3οΈβ£ Editing
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
# Add a sound
|
| 125 |
+
audio = model.edit("Add", "input.wav", desc="skateboarding")
|
| 126 |
+
torchaudio.save("output_add.wav", audio, model.sample_rate)
|
| 127 |
+
|
| 128 |
+
# Remove a sound
|
| 129 |
+
audio = model.edit("Remove", "input.wav", desc="female singing")
|
| 130 |
+
torchaudio.save("output_remove.wav", audio, model.sample_rate)
|
| 131 |
+
|
| 132 |
+
# Extract a sound
|
| 133 |
+
audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
|
| 134 |
+
torchaudio.save("output_extract.wav", audio, model.sample_rate)
|
| 135 |
+
|
| 136 |
+
# Style transfer
|
| 137 |
+
audio = model.edit(
|
| 138 |
+
"Style Transfer",
|
| 139 |
+
"input.wav",
|
| 140 |
+
source_category="playing electric guitar",
|
| 141 |
+
target_category="playing saxophone"
|
| 142 |
+
)
|
| 143 |
+
torchaudio.save("output_transfer.wav", audio, model.sample_rate)
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
## π₯οΈ Gradio Demo
|
| 147 |
+
|
| 148 |
+
```bash
|
| 149 |
+
# Launch interactive demo
|
| 150 |
+
python run_gradio.py \
|
| 151 |
+
--model-config model/Audio-Omni.json \
|
| 152 |
+
--ckpt-path model/model.ckpt \
|
| 153 |
+
--server-port 7777
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
Visit `http://localhost:7777` to access the web interface.
|
| 157 |
+
|
| 158 |
+
## π Documentation
|
| 159 |
+
|
| 160 |
+
For detailed documentation, training instructions, and more examples, visit the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).
|
| 161 |
+
|
| 162 |
+
## π Citation
|
| 163 |
+
|
| 164 |
+
```bibtex
|
| 165 |
+
@inproceedings{tian2026audioomni,
|
| 166 |
+
title={Audio-Omni: Unified Audio Understanding, Generation, and Editing},
|
| 167 |
+
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
|
| 168 |
+
booktitle={ACM Transactions on Graphics (SIGGRAPH 2026)},
|
| 169 |
+
year={2026}
|
| 170 |
+
}
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
## π License
|
| 174 |
+
|
| 175 |
+
**Code**: Apache-2.0 License
|
| 176 |
+
**Model Weights**: CC-BY-NC-4.0 (Non-commercial use only)
|
| 177 |
+
|
| 178 |
+
Commercial use of the model weights requires explicit written authorization from the authors.
|
| 179 |
+
For commercial licensing inquiries, contact: ztianad@connect.ust.hk
|
| 180 |
+
|
| 181 |
+
## π Contact
|
| 182 |
+
|
| 183 |
+
- **Zeyue Tian**: ztianad@connect.ust.hk
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
**For full installation guide, API reference, and advanced usage, see the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).**
|
model.ckpt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:8ddfdd6c965b7b1fc5c5af01e306cdfd78e44f56fd584789b8987233438e65b3
|
| 3 |
-
size 22220882720
|
|
|
|
|
|
|
|
|
|
|
|