Zeyue7 commited on
Commit
9a89c98
Β·
1 Parent(s): 45f5634
Files changed (2) hide show
  1. README.md +160 -10
  2. model.ckpt +0 -3
README.md CHANGED
@@ -6,32 +6,182 @@ tags:
6
  - speech
7
  - diffusion
8
  - multimodal
 
 
 
 
9
  ---
10
 
11
- # Audio-Omni
12
 
13
  **Unified Audio Understanding, Generation, and Editing** (SIGGRAPH 2026)
14
 
15
- ## License
 
 
16
 
17
- The model weights are released under **CC-BY-NC-4.0** (non-commercial use only).
18
 
19
- **Commercial use requires explicit written authorization.** Contact: ztianad@connect.ust.hk
20
 
21
- ## Links
22
 
23
- - [GitHub Repository](https://github.com/Audio-Omni/Audio-Omni)
24
- - [Project Page](https://zeyuet.github.io/Audio-Omni/)
 
25
 
26
- ## Files
27
 
28
  - `Audio-Omni.json` β€” Model configuration
29
  - `model.ckpt` β€” Model checkpoint (~21 GB)
 
30
 
31
- ## Quick Start
 
 
32
 
33
  ```bash
 
 
 
 
 
 
 
 
 
34
  huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
35
  ```
36
 
37
- See the [GitHub repository](https://github.com/Audio-Omni/Audio-Omni) for full usage instructions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - speech
7
  - diffusion
8
  - multimodal
9
+ - audio-generation
10
+ - audio-editing
11
+ - video-to-audio
12
+ - text-to-speech
13
  ---
14
 
15
+ # πŸŽ›οΈ Audio-Omni
16
 
17
  **Unified Audio Understanding, Generation, and Editing** (SIGGRAPH 2026)
18
 
19
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/ZeyueT/Audio-Omni)
20
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://zeyuet.github.io/Audio-Omni/)
21
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/XXXX.XXXXX)
22
 
23
+ ## πŸ“– Overview
24
 
25
+ Audio-Omni is the first end-to-end framework that unifies **understanding**, **generation**, and **editing** across general sound, music, and speech domains. It combines a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
26
 
27
+ ## 🎯 Capabilities
28
 
29
+ - **Understanding**: Audio/video captioning, question answering
30
+ - **Generation**: Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, Text-to-Speech, Voice Conversion
31
+ - **Editing**: Add, Remove, Extract, Style Transfer
32
 
33
+ ## πŸ“¦ Model Files
34
 
35
  - `Audio-Omni.json` β€” Model configuration
36
  - `model.ckpt` β€” Model checkpoint (~21 GB)
37
+ - `synchformer_state_dict.pth` β€” Synchformer checkpoint for video conditioning
38
 
39
+ ## πŸš€ Quick Start
40
+
41
+ ### Installation
42
 
43
  ```bash
44
+ # Clone the GitHub repository
45
+ git clone https://github.com/ZeyueT/Audio-Omni.git
46
+ cd Audio-Omni
47
+
48
+ # Install dependencies
49
+ pip install -e .
50
+ conda install -c conda-forge ffmpeg libsndfile
51
+
52
+ # Download model from Hugging Face
53
  huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
54
  ```
55
 
56
+ ### Python API
57
+
58
+ ```python
59
+ from audio_omni import AudioOmni
60
+ import torchaudio
61
+
62
+ # Load model
63
+ model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
64
+ ```
65
+
66
+ ### 1️⃣ Understanding
67
+
68
+ ```python
69
+ # Audio understanding
70
+ response = model.understand(
71
+ "Describe the sounds in this audio.",
72
+ audio="example.wav"
73
+ )
74
+
75
+ # Video understanding
76
+ response = model.understand(
77
+ "What is happening in this video?",
78
+ video="example.mp4"
79
+ )
80
+
81
+ # Audio + Video understanding
82
+ response = model.understand(
83
+ "Does the audio match the video?",
84
+ audio="example.wav",
85
+ video="example.mp4"
86
+ )
87
+ ```
88
+
89
+ ### 2️⃣ Generation
90
+
91
+ ```python
92
+ # Text-to-Audio
93
+ audio = model.generate("T2A", prompt="A clock ticking.")
94
+ torchaudio.save("output.wav", audio, model.sample_rate)
95
+
96
+ # Text-to-Music
97
+ audio = model.generate(
98
+ "T2M",
99
+ prompt="Compose a bright jazz swing instrumental with walking bass."
100
+ )
101
+ torchaudio.save("music.wav", audio, model.sample_rate)
102
+
103
+ # Video-to-Audio
104
+ audio = model.generate("V2A", video_path="example.mp4")
105
+ torchaudio.save("v2a_output.wav", audio, model.sample_rate)
106
+
107
+ # Text-to-Speech
108
+ audio = model.generate("TTS", prompt="Hello, welcome to Audio-Omni.")
109
+ torchaudio.save("tts_output.wav", audio, model.sample_rate)
110
+
111
+ # Text-to-Speech with voice cloning
112
+ audio = model.generate(
113
+ "TTS",
114
+ prompt="Hello, welcome to Audio-Omni.",
115
+ voice_prompt_path="ref_voice.wav",
116
+ voice_ref_text="This is the reference transcript."
117
+ )
118
+ torchaudio.save("tts_cloned.wav", audio, model.sample_rate)
119
+ ```
120
+
121
+ ### 3️⃣ Editing
122
+
123
+ ```python
124
+ # Add a sound
125
+ audio = model.edit("Add", "input.wav", desc="skateboarding")
126
+ torchaudio.save("output_add.wav", audio, model.sample_rate)
127
+
128
+ # Remove a sound
129
+ audio = model.edit("Remove", "input.wav", desc="female singing")
130
+ torchaudio.save("output_remove.wav", audio, model.sample_rate)
131
+
132
+ # Extract a sound
133
+ audio = model.edit("Extract", "input.wav", desc="wood thrush calling")
134
+ torchaudio.save("output_extract.wav", audio, model.sample_rate)
135
+
136
+ # Style transfer
137
+ audio = model.edit(
138
+ "Style Transfer",
139
+ "input.wav",
140
+ source_category="playing electric guitar",
141
+ target_category="playing saxophone"
142
+ )
143
+ torchaudio.save("output_transfer.wav", audio, model.sample_rate)
144
+ ```
145
+
146
+ ## πŸ–₯️ Gradio Demo
147
+
148
+ ```bash
149
+ # Launch interactive demo
150
+ python run_gradio.py \
151
+ --model-config model/Audio-Omni.json \
152
+ --ckpt-path model/model.ckpt \
153
+ --server-port 7777
154
+ ```
155
+
156
+ Visit `http://localhost:7777` to access the web interface.
157
+
158
+ ## πŸ“š Documentation
159
+
160
+ For detailed documentation, training instructions, and more examples, visit the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).
161
+
162
+ ## πŸ“ Citation
163
+
164
+ ```bibtex
165
+ @inproceedings{tian2026audioomni,
166
+ title={Audio-Omni: Unified Audio Understanding, Generation, and Editing},
167
+ author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
168
+ booktitle={ACM Transactions on Graphics (SIGGRAPH 2026)},
169
+ year={2026}
170
+ }
171
+ ```
172
+
173
+ ## πŸ“„ License
174
+
175
+ **Code**: Apache-2.0 License
176
+ **Model Weights**: CC-BY-NC-4.0 (Non-commercial use only)
177
+
178
+ Commercial use of the model weights requires explicit written authorization from the authors.
179
+ For commercial licensing inquiries, contact: ztianad@connect.ust.hk
180
+
181
+ ## πŸ“­ Contact
182
+
183
+ - **Zeyue Tian**: ztianad@connect.ust.hk
184
+
185
+ ---
186
+
187
+ **For full installation guide, API reference, and advanced usage, see the [GitHub repository](https://github.com/ZeyueT/Audio-Omni).**
model.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ddfdd6c965b7b1fc5c5af01e306cdfd78e44f56fd584789b8987233438e65b3
3
- size 22220882720