NullVoider commited on
Commit
56db29d
·
verified ·
1 Parent(s): 9f4f277

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -3
README.md CHANGED
@@ -1,3 +1,58 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - tencent/HunyuanVideo
6
+ - stabilityai/stable-video-diffusion-img2vid-xt
7
+ tag:
8
+ - text-to-video
9
+ - image-to-video
10
+ - video-to-video
11
+ ---
12
+
13
+ # V1: Human-Centric Video Foundation Model
14
+
15
+ <p align="center">
16
+ <a href="https://github.com/immortalshadow007/V1" target="_blank">🌐 Github</a>
17
+
18
+ ---
19
+ This repo contains Diffusers-format model weights for V1 Text-to-Video, Image-to-Video models, and Video-to-Video. You can find the inference code on our github repository [V1](https://github.com/immortalshadow007/V1).
20
+
21
+ ## Introduction
22
+ V1 is an open-source human-centric video foundation model. By fine-tuning <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> on O(10M) high-quality film and television clips, V1 offers three key advantages:
23
+
24
+ ## 🔑 Key Features
25
+
26
+ ### 1. Advanced Model Capabilities
27
+
28
+ 1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
29
+ 2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
30
+ 3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
31
+
32
+ ### 2. Self-Developed Data Cleaning and Annotation Pipeline
33
+
34
+ Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
35
+
36
+ - **Expression Classification**: Categorizes human facial expressions into 33 distinct types.
37
+ - **Character Spatial Awareness**: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
38
+ - **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
39
+ - **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
40
+
41
+ ### 3. Multi-Stage Image-to-Video Pretraining
42
+
43
+ Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
44
+
45
+ - **Stage 1: Model Domain Transfer Pretraining**: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
46
+ - **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
47
+ - **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
48
+
49
+ ## 📦 Model Introduction
50
+ | Model Name | Resolution | Video Length | FPS |
51
+ |-----------------|------------|--------------|-----|
52
+ | V1-Hunyuan-I2V | 544px960p | 97 | 24 |
53
+ | V1-Hunyuan-T2V | 544px960p | 97 | 24 |
54
+ | V1-SVD-V2V | 544px960p | 97 | 24 |
55
+
56
+ ## Usage
57
+ **See the [Guide](https://github.com/immortalshadow007/V1) for details.**
58
+ ```