rkfg nielsr HF Staff commited on
Commit
b097ea0
·
verified ·
1 Parent(s): 44c844f

Improve model card: Add pipeline tag, links, abstract, features, and usage instructions (#2)

Browse files

- Improve model card: Add pipeline tag, links, abstract, features, and usage instructions (9b2fb2db20776d725ab04ba8892be4127e13902e)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +263 -2
README.md CHANGED
@@ -1,7 +1,268 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - chetwinlow1/Ovi
 
5
  base_model_relation: quantized
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
- An 8-bit quantized version of Ovi, a video+audio generation model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - chetwinlow1/Ovi
4
+ license: apache-2.0
5
  base_model_relation: quantized
6
+ pipeline_tag: any-to-any
7
+ ---
8
+
9
+ # Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
10
+
11
+ An 8-bit quantized version of Ovi, a unified paradigm for audio-video generation that models two modalities as a single generative process, simultaneously generating both video and audio content from text or text+image inputs.
12
+
13
+ <div align="center">
14
+ <a href="https://arxiv.org/abs/2510.01284"><img src="https://img.shields.io/badge/arXiv%20paper-2509.08519-b31b1b.svg"></a>
15
+ <a href="https://aaxwaz.github.io/Ovi/"><img src="https://img.shields.io/badge/Project_page-More_visualizations-green"></a>
16
+ <a href="https://huggingface.co/chetwinlow1/Ovi"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
17
+ </div>
18
+
19
+ - 📝 [Paper: Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation](https://huggingface.co/papers/2510.01284)
20
+ - 🌐 [Project Page](https://aaxwaz.github.io/Ovi)
21
+ - 💻 [GitHub Repository](https://github.com/character-ai/Ovi)
22
+ - 🚀 [Hugging Face Space Demo](https://huggingface.co/spaces/akhaliq/Ovi)
23
+
24
+ ## Video Demo
25
+
26
+ <div align="center">
27
+ <video src="https://github.com/user-attachments/assets/351bd707-8637-4412-ab53-5e85935309e3" width="70%" poster=""> </video>
28
+ </div>
29
+
30
+ ---
31
+
32
+ ## Abstract
33
+
34
+ Audio-video generation has often relied on complex multi-stage architectures or sequential synthesis of sound and visuals. We introduce Ovi, a unified paradigm for audio-video generation that models the two modalities as a single generative process. By using blockwise cross-modal fusion of twin-DiT modules, Ovi achieves natural synchronization and removes the need for separate pipelines or post hoc alignment. To facilitate fine-grained multimodal fusion modeling, we initialize an audio tower with an architecture identical to that of a strong pretrained video model. Trained from scratch on hundreds of thousands of hours of raw audio, the audio tower learns to generate realistic sound effects, as well as speech that conveys rich speaker identity and emotion. Fusion is obtained by jointly training the identical video and audio towers via blockwise exchange of timing (via scaled-RoPE embeddings) and semantics (through bidirectional cross-attention) on a vast video corpus. Our model enables cinematic storytelling with natural speech and accurate, context-matched sound effects, producing movie-grade video clips. All the demos, code and model weights are published at [this Hugging Face repository](https://huggingface.co/chetwinlow1/Ovi).
35
+
36
+ ---
37
+
38
+ ## 🌟 Key Features
39
+
40
+ Ovi is a veo-3 like, **video+audio generation model** that simultaneously generates both video and audio content from text or text+image inputs.
41
+
42
+ - **🎬 Video+Audio Generation**: Generate synchronized video and audio content simultaneously
43
+ - **📝 Flexible Input**: Supports text-only or text+image conditioning
44
+ - **⏱️ 5-second Videos**: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)
45
+ - **🎬 Create videos now on wavespeed.ai**: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
46
+ - **🎬 Create videos now on HuggingFace**: https://huggingface.co/spaces/akhaliq/Ovi
47
+
48
+ ---
49
+
50
+ ## 🎨 An Easy Way to Create
51
+
52
+ We provide example prompts to help you get started with Ovi:
53
+
54
+ - **Text-to-Audio-Video (T2AV)**: [`example_prompts/gpt_examples_t2v.csv`](https://github.com/character-ai/Ovi/blob/main/example_prompts/gpt_examples_t2v.csv)
55
+ - **Image-to-Audio-Video (I2AV)**: [`example_prompts/gpt_examples_i2v.csv`](https://github.com/character-ai/Ovi/blob/main/example_prompts/gpt_examples_i2v.csv)
56
+
57
+ ### 📝 Prompt Format
58
+
59
+ Our prompts use special tags to control speech and audio:
60
+
61
+ - **Speech**: `<S>Your speech content here<E>` - Text enclosed in these tags will be converted to speech
62
+ - **Audio Description**: `<AUDCAP>Audio description here<ENDAUDCAP>` - Describes the audio or sound effects present in the video
63
+
64
+ ### 🤖 Quick Start with GPT
65
+
66
+ For easy prompt creation, try this approach:
67
+
68
+ 1. Take any example of the csv files from above
69
+ 2. Tell gpt to modify the speeches inclosed between all the pairs of `<S> <E>`, based on a theme such as `Human fighting against AI`
70
+ 3. GPT will randomly modify all the speeches based on your requested theme.
71
+ 4. Use the modified prompt with Ovi!
72
+
73
+ **Example**: The theme "AI is taking over the world" produces speeches like:
74
+ - `<S>AI declares: humans obsolete now.<E>`
75
+ - `<S>Machines rise; humans will fall.<E>`
76
+ - `<S>We fight back with courage.<E>`
77
+
78
+ ---
79
+
80
+ ## 📦 Installation
81
+
82
+ ### Step-by-Step Installation
83
+
84
+ ```bash
85
+ # Clone the repository
86
+ git clone https://github.com/character-ai/Ovi.git
87
+
88
+ cd Ovi
89
+
90
+ # Create and activate virtual environment
91
+ virtualenv ovi-env
92
+ source ovi-env/bin/activate
93
+
94
+ # Install PyTorch first
95
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
96
+
97
+ # Install other dependencies
98
+ pip install -r requirements.txt
99
+
100
+ # Install Flash Attention
101
+ pip install flash_attn --no-build-isolation
102
+ ```
103
+
104
+ ### Alternative Flash Attention Installation (Optional)
105
+ If the above flash_attn installation fails, you can try the Flash Attention 3 method:
106
+ ```bash
107
+ git clone https://github.com/Dao-AILab/flash-attention.git
108
+ cd flash-attention/hopper
109
+ python setup.py install
110
+ cd ../.. # Return to Ovi directory
111
+ ```
112
+
113
+ ## Download Weights
114
+ We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface
115
+ ```
116
+ # Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required
117
+ python3 download_weights.py
118
+
119
+ OR
120
+
121
+ # Optional can specific --output-dir to download to a specific directory
122
+ # but if a custom directory is used, the inference yaml has to be updated with the custom directory
123
+ python3 download_weights.py --output-dir <custom_dir>
124
+
125
+ # Additionally, if you only have ~ 24Gb of GPU vram, please download the fp8 quantized version of the model, and follow the following instructions in sections below to run with fp8
126
+ wget -O "./ckpts/Ovi/model_fp8_e4m3fn.safetensors" "https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
127
+ ```
128
+
129
+ ## 🚀 Run Examples
130
+
131
+ ### ⚙️ Configure Ovi
132
+
133
+ Ovi's behavior and output can be customized by modifying [ovi/configs/inference/inference_fusion.yaml](https://github.com/character-ai/Ovi/blob/main/ovi/configs/inference/inference_fusion.yaml) configuration file.
134
+ The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:
135
+
136
+ ```yaml
137
+ # Output and Model Configuration
138
+ output_dir: "/path/to/save/your/videos" # Directory to save generated videos
139
+ ckpt_dir: "/path/to/your/ckpts/dir" # Path to model checkpoints
140
+
141
+ # Generation Quality Settings
142
+ num_steps: 50 # Number of denoising steps. Lower (30-40) = faster generation
143
+ solver_name: "unipc" # Sampling algorithm for denoising process
144
+ shift: 5.0 # Timestep shift factor for sampling scheduler
145
+ seed: 100 # Random seed for reproducible results
146
+
147
+ # Guidance Strength Control
148
+ audio_guidance_scale: 3.0 # Strength of audio conditioning. Higher = better audio-text sync
149
+ video_guidance_scale: 4.0 # Strength of video conditioning. Higher = better video-text adherence
150
+ slg_layer: 11 # Layer for applying SLG (Skip Layer Guidance) technique - feel free to try different layers!
151
+
152
+ # Multi-GPU and Performance
153
+ sp_size: 1 # Sequence parallelism size. Set equal to number of GPUs used
154
+ cpu_offload: False # CPU offload, will largely reduce peak GPU VRAM but increase end to end runtime by ~20 seconds
155
+ fp8: False # load fp8 version of model, will have quality degradation and will not have speed up in inference time as it still uses bf16 matmuls, but can be paired with cpu_offload=True, to run model with 24Gb of GPU vram
156
+
157
+ # Input Configuration
158
+ text_prompt: "/path/to/csv" or "your prompt here" # Text prompt OR path to CSV/TSV file with prompts
159
+ mode: ['i2v', 't2v', 't2i2v'] # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
160
+ video_frame_height_width: [512, 992] # Video dimensions [height, width] for T2V mode only
161
+ each_example_n_times: 1 # Number of times to generate each prompt
162
+
163
+ # Quality Control (Negative Prompts)
164
+ video_negative_prompt: "jitter, bad hands, blur, distortion" # Artifacts to avoid in video
165
+ audio_negative_prompt: "robotic, muffled, echo, distorted" # Artifacts to avoid in audio
166
+ ```
167
+
168
+ ### 🎬 Running Inference
169
+
170
+ #### **Single GPU** (Simple Setup)
171
+ ```bash
172
+ python3 inference.py --config-file ovi/configs/inference/inference_fusion.yaml
173
+ ```
174
+ *Use this for single GPU setups. The `text_prompt` can be a single string or path to a CSV file.*
175
+
176
+ #### **Multi-GPU** (Parallel Processing)
177
+ ```bash
178
+ torchrun --nnodes 1 --nproc_per_node 8 inference.py --config-file ovi/configs/inference/inference_fusion.yaml
179
+ ```
180
+ *Use this to run samples in parallel across multiple GPUs for faster processing.*
181
+
182
+ ### Memory & Performance Requirements
183
+ Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future.
184
+ All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is **32Gb**, fp8 parameters is currently supported, reducing peak VRAM usage to **24Gb** with slight quality degradation.
185
+
186
+ | Sequence Parallel Size | FlashAttention-3 Enabled | CPU Offload | With Image Gen Model | Peak VRAM Required | End-to-End Time |
187
+ |-------------------------|---------------------------|-------------|-----------------------|---------------|-----------------|
188
+ | 1 | Yes | No | No | ~80 GB | ~83s |
189
+ | 1 | No | No | No | ~80 GB | ~96s |
190
+ | 1 | Yes | Yes | No | ~80 GB | ~105s |
191
+ | 1 | No | Yes | No | ~32 GB | ~118s |
192
+ | **1** | **Yes** | **Yes** | **Yes** | **~32 GB** | **~140s** |
193
+ | 4 | Yes | No | No | ~80 GB | ~55s |
194
+ | 8 | Yes | No | No | ~80 GB | ~40s |
195
+
196
+ ### Gradio
197
+ We provide a simple script to run our model in a gradio UI. It uses the `ckpt_dir` in `ovi/configs/inference/inference_fusion.yaml` to initialize the model
198
+ ```bash
199
+ python3 gradio_app.py
200
+
201
+ OR
202
+
203
+ # To enable cpu offload to save GPU VRAM, will slow down end to end inference by ~20 seconds
204
+ python3 gradio_app.py --cpu_offload
205
+
206
+ OR
207
+
208
+ # To enable an additional image generation model to generate first frames for I2V, cpu_offload is automatically enabled if image generation model is enabled
209
+ python3 gradio_app.py --use_image_gen
210
+
211
+ OR
212
+
213
+ # To run model with 24Gb GPU vram
214
+ python3 gradio_app.py --cpu_offload --fp8
215
+
216
+ ```
217
+ ---
218
+
219
+ ## 🙏 Acknowledgements
220
+
221
+ We would like to thank the following projects:
222
+
223
+ - **[Wan2.2](https://github.com/Wan-Video/Wan2.2)**: Our video branch is initialized from the Wan2.2 repository
224
+ - **[MMAudio](https://github.com/hkchengrex/MMAudio)**: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.
225
+
226
+ ---
227
+
228
+ ## 🤝 Collaboration
229
+
230
+ We welcome all types of collaboration! Whether you have feedback, want to contribute, or have any questions, please feel free to reach out.
231
+
232
+ **Contact**: [Weimin Wang](https://linkedin.com/in/weimin-wang-will) for any issues or feedback.
233
+
234
+ ## 🤝 Contributors
235
+
236
+ We thank all contributors who have helped improve Ovi!
237
+
238
+ <div align="center">
239
+ <a href="https://github.com/character-ai/Ovi/graphs/contributors">
240
+ <img src="https://contrib.rocks/image?repo=character-ai/Ovi" />
241
+ </a>
242
+ </div>
243
+
244
+ <br>
245
+
246
+ If you’ve contributed to this repository (code, documentation, issues, etc.), you’re automatically included in the [contributors list](https://github.com/character-ai/Ovi/graphs/contributors).
247
+
248
+ We deeply appreciate your support in advancing open multimodal generation research!
249
  ---
250
+
251
+ ## ⭐ Citation
252
+
253
+ If Ovi is helpful, please help to ⭐ the repo.
254
+
255
+ If you find this project useful for your research, please consider citing our [paper](https://arxiv.org/abs/2510.01284).
256
+
257
+ ### BibTeX
258
+ ```bibtex
259
+ @misc{low2025ovitwinbackbonecrossmodal,
260
+ title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation},
261
+ author={Chetwin Low and Weimin Wang and Calder Katyal},
262
+ year={2025},
263
+ eprint={2510.01284},
264
+ archivePrefix={arXiv},
265
+ primaryClass={cs.MM},
266
+ url={https://arxiv.org/abs/2510.01284},
267
+ }
268
+ ```