chetwinlow1 commited on
Commit
e6735f8
·
verified ·
1 Parent(s): a5a8d05

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -55
README.md CHANGED
@@ -12,53 +12,131 @@ base_model:
12
  <h1> Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation </h1>
13
 
14
  <a href="https://arxiv.org/abs/2510.01284"><img src="https://img.shields.io/badge/arXiv%20paper-2510.01284-b31b1b.svg"></a>
15
- <a href="https://github.com/character-ai/Ovi"><img src="https://img.shields.io/badge/Code-GitHub-181717.svg?logo=github"></a>
16
  <a href="https://aaxwaz.github.io/Ovi/"><img src="https://img.shields.io/badge/Project_page-More_visualizations-green"></a>
17
  <a href="https://huggingface.co/chetwinlow1/Ovi"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
18
- <a href="https://huggingface.co/spaces/akhaliq/Ovi"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
19
 
20
  [Chetwin Low](https://www.linkedin.com/in/chetwin-low-061975193/)<sup> * 1 </sup>, [Weimin Wang](https://www.linkedin.com/in/weimin-wang-will/)<sup> * &dagger; 1 </sup>, [Calder Katyal](https://www.linkedin.com/in/calder-katyal-a8a9b3225/)<sup> 2 </sup><br>
21
  <sup> * </sup>Equal contribution, <sup> &dagger; </sup>Project Lead<br>
22
  <sup> 1 </sup>Character AI, <sup> 2 </sup>Yale University
23
-
24
  </div>
25
 
26
- ## Video Demo
 
 
27
 
 
28
  <div align="center">
29
- <video width="70%" controls>
30
- <source src="https://huggingface.co/chetwinlow1/Ovi/resolve/main/assets/ovi_trailer.mp4" type="video/mp4">
31
- Your browser does not support the video tag.
32
- </video>
33
  </div>
34
 
35
- ---
 
 
 
36
 
37
- ## 🌟 Key Features
38
 
39
- Ovi is a veo-3 like, **video+audio generation model** that simultaneously generates both video and audio content from text or text+image inputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
 
 
41
  - **🎬 Video+Audio Generation**: Generate synchronized video and audio content simultaneously
 
42
  - **📝 Flexible Input**: Supports text-only or text+image conditioning
43
- - **⏱️ 5-second Videos**: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)
 
44
  - **🎬 Create videos now on wavespeed.ai**: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
45
  - **🎬 Create videos now on HuggingFace**: https://huggingface.co/spaces/akhaliq/Ovi
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ---
48
  ## 📋 Todo List
49
 
50
- - [x] Release research paper and [microsite for demos](https://aaxwaz.github.io/Ovi)
51
  - [x] Checkpoint of 11B model
52
  - [x] Inference Codes
53
  - [x] Text or Text+Image as input
54
  - [x] Gradio application code
55
  - [x] Multi-GPU inference with or without the support of sequence parallel
56
  - [x] fp8 weights and improved memory efficiency (credits to [@rkfg](https://github.com/rkfg))
 
57
  - [ ] Improve efficiency of Sequence Parallel implementation
58
  - [ ] Implement Sharded inference with FSDP
59
  - [x] Video creation example prompts and format
60
- - [ ] Finetuned model with higher resolution
61
- - [ ] Longer video generation
 
62
  - [ ] Distilled model for faster inference
63
  - [ ] Training scripts
64
 
@@ -67,30 +145,16 @@ Ovi is a veo-3 like, **video+audio generation model** that simultaneously genera
67
  ## 🎨 An Easy Way to Create
68
 
69
  We provide example prompts to help you get started with Ovi:
70
-
 
71
  - **Text-to-Audio-Video (T2AV)**: [`example_prompts/gpt_examples_t2v.csv`](example_prompts/gpt_examples_t2v.csv)
72
  - **Image-to-Audio-Video (I2AV)**: [`example_prompts/gpt_examples_i2v.csv`](example_prompts/gpt_examples_i2v.csv)
73
 
74
  ### 📝 Prompt Format
75
 
76
  Our prompts use special tags to control speech and audio:
77
-
78
  - **Speech**: `<S>Your speech content here<E>` - Text enclosed in these tags will be converted to speech
79
- - **Audio Description**: `<AUDCAP>Audio description here<ENDAUDCAP>` - Describes the audio or sound effects present in the video
80
-
81
- ### 🤖 Quick Start with GPT
82
-
83
- For easy prompt creation, try this approach:
84
-
85
- 1. Take any example of the csv files from above
86
- 2. Tell gpt to modify the speeches inclosed between all the pairs of `<S> <E>`, based on a theme such as `Human fighting against AI`
87
- 3. GPT will randomly modify all the speeches based on your requested theme.
88
- 4. Use the modified prompt with Ovi!
89
-
90
- **Example**: The theme "AI is taking over the world" produces speeches like:
91
- - `<S>AI declares: humans obsolete now.<E>`
92
- - `<S>Machines rise; humans will fall.<E>`
93
- - `<S>We fight back with courage.<E>`
94
 
95
  ---
96
 
@@ -110,7 +174,7 @@ virtualenv ovi-env
110
  source ovi-env/bin/activate
111
 
112
  # Install PyTorch first
113
- pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
114
 
115
  # Install other dependencies
116
  pip install -r requirements.txt
@@ -129,10 +193,13 @@ cd ../.. # Return to Ovi directory
129
  ```
130
 
131
  ## Download Weights
132
- We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface
 
133
  ```
134
  # Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required
 
135
  python3 download_weights.py
 
136
 
137
  OR
138
 
@@ -140,6 +207,10 @@ OR
140
  # but if a custom directory is used, the inference yaml has to be updated with the custom directory
141
  python3 download_weights.py --output-dir <custom_dir>
142
 
 
 
 
 
143
  # Additionally, if you only have ~ 24Gb of GPU vram, please download the fp8 quantized version of the model, and follow the following instructions in sections below to run with fp8
144
  wget -O "./ckpts/Ovi/model_fp8_e4m3fn.safetensors" "https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
145
  ```
@@ -153,11 +224,12 @@ The following parameters control generation quality, video resolution, and how t
153
 
154
  ```yaml
155
  # Output and Model Configuration
 
156
  output_dir: "/path/to/save/your/videos" # Directory to save generated videos
157
  ckpt_dir: "/path/to/your/ckpts/dir" # Path to model checkpoints
158
 
159
  # Generation Quality Settings
160
- num_steps: 50 # Number of denoising steps. Lower (30-40) = faster generation
161
  solver_name: "unipc" # Sampling algorithm for denoising process
162
  shift: 5.0 # Timestep shift factor for sampling scheduler
163
  seed: 100 # Random seed for reproducible results
@@ -175,7 +247,7 @@ fp8: False # load fp8 version of model, will have
175
  # Input Configuration
176
  text_prompt: "/path/to/csv" or "your prompt here" # Text prompt OR path to CSV/TSV file with prompts
177
  mode: ['i2v', 't2v', 't2i2v'] # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
178
- video_frame_height_width: [512, 992] # Video dimensions [height, width] for T2V mode only
179
  each_example_n_times: 1 # Number of times to generate each prompt
180
 
181
  # Quality Control (Negative Prompts)
@@ -227,6 +299,9 @@ python3 gradio_app.py --use_image_gen
227
 
228
  OR
229
 
 
 
 
230
  # To run model with 24Gb GPU vram
231
  python3 gradio_app.py --cpu_offload --fp8
232
 
@@ -238,7 +313,7 @@ python3 gradio_app.py --cpu_offload --fp8
238
  We would like to thank the following projects:
239
 
240
  - **[Wan2.2](https://github.com/Wan-Video/Wan2.2)**: Our video branch is initialized from the Wan2.2 repository
241
- - **[MMAudio](https://github.com/hkchengrex/MMAudio)**: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.
242
 
243
  ---
244
 
@@ -249,23 +324,6 @@ We welcome all types of collaboration! Whether you have feedback, want to contri
249
  **Contact**: [Weimin Wang](https://linkedin.com/in/weimin-wang-will) for any issues or feedback.
250
 
251
 
252
- ## 🤝 Contributors
253
-
254
- We thank all contributors who have helped improve Ovi!
255
-
256
- <div align="center">
257
- <a href="https://github.com/character-ai/Ovi/graphs/contributors">
258
- <img src="https://contrib.rocks/image?repo=character-ai/Ovi" />
259
- </a>
260
- </div>
261
-
262
- <br>
263
-
264
- If you’ve contributed to this repository (code, documentation, issues, etc.), you’re automatically included in the [contributors list](https://github.com/character-ai/Ovi/graphs/contributors).
265
-
266
- We deeply appreciate your support in advancing open multimodal generation research!
267
- ---
268
-
269
  ## ⭐ Citation
270
 
271
  If Ovi is helpful, please help to ⭐ the repo.
@@ -284,4 +342,4 @@ If you find this project useful for your research, please consider citing our [p
284
  primaryClass={cs.MM},
285
  url={https://arxiv.org/abs/2510.01284},
286
  }
287
- ```
 
12
  <h1> Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation </h1>
13
 
14
  <a href="https://arxiv.org/abs/2510.01284"><img src="https://img.shields.io/badge/arXiv%20paper-2510.01284-b31b1b.svg"></a>
 
15
  <a href="https://aaxwaz.github.io/Ovi/"><img src="https://img.shields.io/badge/Project_page-More_visualizations-green"></a>
16
  <a href="https://huggingface.co/chetwinlow1/Ovi"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
 
17
 
18
  [Chetwin Low](https://www.linkedin.com/in/chetwin-low-061975193/)<sup> * 1 </sup>, [Weimin Wang](https://www.linkedin.com/in/weimin-wang-will/)<sup> * &dagger; 1 </sup>, [Calder Katyal](https://www.linkedin.com/in/calder-katyal-a8a9b3225/)<sup> 2 </sup><br>
19
  <sup> * </sup>Equal contribution, <sup> &dagger; </sup>Project Lead<br>
20
  <sup> 1 </sup>Character AI, <sup> 2 </sup>Yale University
 
21
  </div>
22
 
23
+ ---
24
+
25
+ ## 🎥 Video Demo
26
 
27
+ ### 🆕 Ovi 1.1 10-Second Demo
28
  <div align="center">
29
+ <video src="https://github.com/user-attachments/assets/191f51fb-ef5a-4197-b26f-a5369dc2c007"
30
+ width="70%" controls playsinline preload="metadata"></video>
31
+ <p><em>Ovi 1.1 10-second temporally consistent video generation (960 × 960 resolution)</em></p>
 
32
  </div>
33
 
34
+ ### 🎬 Original 5-Second Demo
35
+ <div align="center">
36
+ <video src="https://github.com/user-attachments/assets/351bd707-8637-4412-ab53-5e85935309e3" width="70%" poster=""> </video>
37
+ </div>
38
 
39
+ ---
40
 
41
+ # 🆕 Ovi 1.1 Update (10 November 2025)
42
+
43
+ - **Release Date:** Coming in 1 day
44
+ - **Key Feature:** Enables *temporal-consistent 10-second video generation* at **960 × 960 resolution**
45
+ - **Training Improvements:**
46
+ - Trained natively on 960×960 resolution videos
47
+ - Dataset includes **100% more videos** for greater diversity
48
+ -
49
+ - **Prompt Format Update:**
50
+ - Audio descriptions should now be written as
51
+ ```
52
+ Audio: ...
53
+ ```
54
+ instead of using
55
+ ```
56
+ <AUDCAP> ... <ENDAUDCAP>
57
+ ```
58
 
59
+ ## 🌟 Key Features
60
+ Ovi is a veo-3-like, **video + audio generation model** that simultaneously generates both video and audio content from text or text + image inputs.
61
  - **🎬 Video+Audio Generation**: Generate synchronized video and audio content simultaneously
62
+ - **🎵 High-Quality Audio Branch**: We designed and pretrained our 5B audio branch from scratch using our high quality in-house audio datasets
63
  - **📝 Flexible Input**: Supports text-only or text+image conditioning
64
+ - **⏱️ 10-second (or 5-second) Videos**: Generates 10-second or 5-second videos at 24 FPS, resolution of 960x960p, at various aspect ratios (9:16, 16:9, 1:1, etc)
65
+ - **🔧 ComfyUI Integration**: ComfyUI support is now available via [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper), related [PR](https://github.com/kijai/ComfyUI-WanVideoWrapper/issues/1343#issuecomment-3382969479).
66
  - **🎬 Create videos now on wavespeed.ai**: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video
67
  - **🎬 Create videos now on HuggingFace**: https://huggingface.co/spaces/akhaliq/Ovi
68
 
69
+ ### 🎯 10-second examples
70
+
71
+ <div align="center"><table><tr>
72
+ <td width="20%">
73
+ <video src="https://github.com/user-attachments/assets/c7e75ef8-adf9-4612-a279-56e4cf7ce146" width="100%" controls playsinline preload="metadata"></video>
74
+ </td>
75
+ <td width="20%">
76
+ <video src="https://github.com/user-attachments/assets/025f5936-883e-4851-bf35-1a809769ba97" width="100%" controls playsinline preload="metadata"></video>
77
+ </td>
78
+ <td width="20%">
79
+ <video src="https://github.com/user-attachments/assets/9e5bf0df-74d6-4e04-a7d0-e5b64616afa9" width="100%" controls playsinline preload="metadata"></video>
80
+ </td>
81
+ <td width="20%">
82
+ <video src="https://github.com/user-attachments/assets/499cefde-c5f8-4afc-b77a-6cd9293b8ac6" width="100%" controls playsinline preload="metadata"></video>
83
+ </td>
84
+ <td width="20%">
85
+ <video src="https://github.com/user-attachments/assets/73390370-afa7-4604-97b6-80995b615d43" width="100%" controls playsinline preload="metadata"></video>
86
+ </td>
87
+ <td width="20%">
88
+ <video src="https://github.com/user-attachments/assets/e11c6f2d-6098-41bb-9bca-a99796a58424" width="100%" controls playsinline preload="metadata"></video>
89
+ </td>
90
+ </tr></table>
91
+ <p>Click the ⛶ button on any video to view full screen.</p>
92
+ </div>
93
+
94
+ ### 🎯 5-second examples
95
+
96
+ <div align="center"><table><tr>
97
+ <td width="20%">
98
+ <video src="https://github.com/user-attachments/assets/c6b35565-df00-4494-b38a-7dcae90f63e5" width="100%" controls playsinline preload="metadata"></video>
99
+ </td>
100
+ <td width="20%">
101
+ <video src="https://github.com/user-attachments/assets/2ce6ff72-eadd-4cf4-b343-b465f0624571" width="100%" controls playsinline preload="metadata"></video>
102
+ </td>
103
+ <td width="20%">
104
+ <video src="https://github.com/user-attachments/assets/7c1dbbea-dfb7-44d7-a4a1-d70a2e00f51a" width="100%" controls playsinline preload="metadata"></video>
105
+ </td>
106
+ <td width="20%">
107
+ <video src="https://github.com/user-attachments/assets/4e41d1b3-7d39-49a8-ab71-e910088f29ee" width="100%" controls playsinline preload="metadata"></video>
108
+ </td>
109
+ <td width="20%">
110
+ <video src="https://github.com/user-attachments/assets/4ad3ad70-1fea-4a2d-9201-808f4746c55e" width="100%" controls playsinline preload="metadata"></video>
111
+ </td>
112
+ <td width="20%">
113
+ <video src="https://github.com/user-attachments/assets/60792c08-12de-49c3-860f-12ac94730940" width="100%" controls playsinline preload="metadata"></video>
114
+ </td>
115
+ <td width="20%">
116
+ <video src="https://github.com/user-attachments/assets/0f3a318b-ac74-43c4-81a5-503f06c65e99" width="100%" controls playsinline preload="metadata"></video>
117
+ </td>
118
+ </tr></table>
119
+ <p>Click the ⛶ button on any video to view full screen.</p>
120
+ </div>
121
+
122
+
123
  ---
124
  ## 📋 Todo List
125
 
126
+ - [x] Release research paper and [website for demos](https://aaxwaz.github.io/Ovi)
127
  - [x] Checkpoint of 11B model
128
  - [x] Inference Codes
129
  - [x] Text or Text+Image as input
130
  - [x] Gradio application code
131
  - [x] Multi-GPU inference with or without the support of sequence parallel
132
  - [x] fp8 weights and improved memory efficiency (credits to [@rkfg](https://github.com/rkfg))
133
+ - [x] qint8 quantization thanks to [@gluttony-10](https://github.com/character-ai/Ovi/commits?author=gluttony-10)
134
  - [ ] Improve efficiency of Sequence Parallel implementation
135
  - [ ] Implement Sharded inference with FSDP
136
  - [x] Video creation example prompts and format
137
+ - [x] Finetune model with higher resolution data, and RL for performance improvement.
138
+ - [x] Longer video generation (10s)
139
+ - [ ] Reference voice condition
140
  - [ ] Distilled model for faster inference
141
  - [ ] Training scripts
142
 
 
145
  ## 🎨 An Easy Way to Create
146
 
147
  We provide example prompts to help you get started with Ovi:
148
+ - **Text-to-Audio-Video (T2AV) 10s**: [`example_prompts/gpt_examples_t2v.csv`](example_prompts/gpt_examples_10s_t2v.csv)
149
+ - **Image-to-Audio-Video (I2AV) 10s**: [`example_prompts/gpt_examples_i2v.csv`](example_prompts/gpt_examples_10s_i2v.csv)
150
  - **Text-to-Audio-Video (T2AV)**: [`example_prompts/gpt_examples_t2v.csv`](example_prompts/gpt_examples_t2v.csv)
151
  - **Image-to-Audio-Video (I2AV)**: [`example_prompts/gpt_examples_i2v.csv`](example_prompts/gpt_examples_i2v.csv)
152
 
153
  ### 📝 Prompt Format
154
 
155
  Our prompts use special tags to control speech and audio:
 
156
  - **Speech**: `<S>Your speech content here<E>` - Text enclosed in these tags will be converted to speech
157
+ - **Audio Description**: `Audio: YOUR AUDIO DESCRIPTION` - Describes the audio or sound effects present in the video **at the end of prompt!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
  ---
160
 
 
174
  source ovi-env/bin/activate
175
 
176
  # Install PyTorch first
177
+ pip install torch==2.6.0 torchvision torchaudio
178
 
179
  # Install other dependencies
180
  pip install -r requirements.txt
 
193
  ```
194
 
195
  ## Download Weights
196
+ To download our main Ovi checkpoint, as well as T5 and vae decoder from Wan, and audio vae from MMAudio
197
+
198
  ```
199
  # Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required
200
+ # Default installs all versions of Ovi models, 720x720_5s, 960x960_5s, 960x960_10s
201
  python3 download_weights.py
202
+ # For qint8 also ues python3 download_weights.py
203
 
204
  OR
205
 
 
207
  # but if a custom directory is used, the inference yaml has to be updated with the custom directory
208
  python3 download_weights.py --output-dir <custom_dir>
209
 
210
+ # Optional can specific --models to download selective versions of Ovi instead of all of them
211
+ # but if a custom directory is used, the inference yaml has to be updated with the custom directory
212
+ python3 download_weights.py --models 960x960_10s # ["720x720_5s", "960x960_5s", "960x960_10s"]
213
+
214
  # Additionally, if you only have ~ 24Gb of GPU vram, please download the fp8 quantized version of the model, and follow the following instructions in sections below to run with fp8
215
  wget -O "./ckpts/Ovi/model_fp8_e4m3fn.safetensors" "https://huggingface.co/rkfg/Ovi-fp8_quantized/resolve/main/model_fp8_e4m3fn.safetensors"
216
  ```
 
224
 
225
  ```yaml
226
  # Output and Model Configuration
227
+ model_name: "960x960_10s" # ["720x720_5s", "960x960_5s", "960x960_10s"]
228
  output_dir: "/path/to/save/your/videos" # Directory to save generated videos
229
  ckpt_dir: "/path/to/your/ckpts/dir" # Path to model checkpoints
230
 
231
  # Generation Quality Settings
232
+ sample_steps: 50 # Number of denoising steps. Lower (30-40) = faster generation
233
  solver_name: "unipc" # Sampling algorithm for denoising process
234
  shift: 5.0 # Timestep shift factor for sampling scheduler
235
  seed: 100 # Random seed for reproducible results
 
247
  # Input Configuration
248
  text_prompt: "/path/to/csv" or "your prompt here" # Text prompt OR path to CSV/TSV file with prompts
249
  mode: ['i2v', 't2v', 't2i2v'] # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
250
+ video_frame_height_width: [704, 1280] # Video dimensions [height, width] for T2V mode only
251
  each_example_n_times: 1 # Number of times to generate each prompt
252
 
253
  # Quality Control (Negative Prompts)
 
299
 
300
  OR
301
 
302
+ # To run model with 24Gb GPU vram. No need to download additional models.
303
+ python3 gradio_app.py --cpu_offload --qint8
304
+
305
  # To run model with 24Gb GPU vram
306
  python3 gradio_app.py --cpu_offload --fp8
307
 
 
313
  We would like to thank the following projects:
314
 
315
  - **[Wan2.2](https://github.com/Wan-Video/Wan2.2)**: Our video branch is initialized from the Wan2.2 repository
316
+ - **[MMAudio](https://github.com/hkchengrex/MMAudio)**: We reused MMAudio's audio vae.
317
 
318
  ---
319
 
 
324
  **Contact**: [Weimin Wang](https://linkedin.com/in/weimin-wang-will) for any issues or feedback.
325
 
326
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327
  ## ⭐ Citation
328
 
329
  If Ovi is helpful, please help to ⭐ the repo.
 
342
  primaryClass={cs.MM},
343
  url={https://arxiv.org/abs/2510.01284},
344
  }
345
+ ```