NullVoider commited on
Commit
d13af22
·
verified ·
1 Parent(s): 69077d8

Update README.md

Browse files

Updated the order of key features.

Files changed (1) hide show
  1. README.md +10 -9
README.md CHANGED
@@ -24,13 +24,20 @@ V1 is an open-source human-centric video foundation model. By fine-tuning <a hre
24
 
25
  ## 🔑 Key Features
26
 
27
- ### 1. Advanced Model Capabilities
 
 
 
 
 
 
 
28
 
29
  1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
30
  2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
31
  3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
32
 
33
- ### 2. Self-Developed Data Cleaning and Annotation Pipeline
34
 
35
  Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
36
 
@@ -39,7 +46,7 @@ Our model is built on a self-developed data cleaning and annotation pipeline, cr
39
  - **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
40
  - **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
41
 
42
- ### 3. Multi-Stage Image-to-Video Pretraining
43
 
44
  Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
45
 
@@ -47,12 +54,6 @@ Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingfa
47
  - **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
48
  - **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
49
 
50
- ### 4. Video-to-Video Generation Pipeline
51
-
52
- The V1 model is a hybrid architecture combining the HunyuanVideo model by Tencent and Stable Video Diffusion (SVD) by Stability AI. During inference, the model accepts a user prompt and an optional video input, which are processed before video generation. For Video-to-Video (V2V) generation, the system employs video interpolation techniques to extract frames from the input video. These frames are organized by timestamp and used as image inputs for the Stable Video Diffusion (SVD) model, alongside the user prompt, to generate the final video.
53
-
54
- At the inference stage, the backend dynamically switches between the HunyuanVideo and Stable Video Diffusion (SVD) models based on the input file type. By default, V1 uses a fine-tuned version of the HunyuanVideo model. However, when a video file is detected in the user input, the system automatically switches to the Stable Video Diffusion (SVD) model, enabling a "Video-to-Video" generation workflow.
55
-
56
  ## 📦 Model Introduction
57
  | Model Name | Resolution | Video Length | FPS |
58
  |-----------------|------------|--------------|-----|
 
24
 
25
  ## 🔑 Key Features
26
 
27
+ ### 1. Video-to-Video Generation Pipeline
28
+
29
+ The V1 model is a hybrid architecture combining the HunyuanVideo model by Tencent and Stable Video Diffusion (SVD) by Stability AI. During inference, the model accepts a user prompt and an optional video input, which are processed before video generation. For Video-to-Video (V2V) generation, the system employs video interpolation techniques to extract frames from the input video. These frames are organized by timestamp and used as image inputs for the Stable Video Diffusion (SVD) model, alongside the user prompt, to generate the final video.
30
+
31
+ At the inference stage, the backend dynamically switches between the HunyuanVideo and Stable Video Diffusion (SVD) models based on the input file type. By default, V1 uses a fine-tuned version of the HunyuanVideo model. However, when a video file is detected in the user input, the system automatically switches to the Stable Video Diffusion (SVD) model, enabling a "Video-to-Video" generation workflow.
32
+
33
+
34
+ ### 2. Advanced Model Capabilities
35
 
36
  1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
37
  2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
38
  3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
39
 
40
+ ### 3. Self-Developed Data Cleaning and Annotation Pipeline
41
 
42
  Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
43
 
 
46
  - **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
47
  - **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
48
 
49
+ ### 4. Multi-Stage Image-to-Video Pretraining
50
 
51
  Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
52
 
 
54
  - **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
55
  - **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
56
 
 
 
 
 
 
 
57
  ## 📦 Model Introduction
58
  | Model Name | Resolution | Video Length | FPS |
59
  |-----------------|------------|--------------|-----|