NullVoider
/

V1

@@ -24,13 +24,20 @@ V1 is an open-source human-centric video foundation model. By fine-tuning <a hre
 ## 🔑 Key Features
-### 1. Advanced Model Capabilities
 1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
 2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
 3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
-### 2. Self-Developed Data Cleaning and Annotation Pipeline
 Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
@@ -39,7 +46,7 @@ Our model is built on a self-developed data cleaning and annotation pipeline, cr
 - **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
 - **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
-### 3. Multi-Stage Image-to-Video Pretraining
 Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
@@ -47,12 +54,6 @@ Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingfa
 - **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
 - **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
-### 4. Video-to-Video Generation Pipeline
-The V1 model is a hybrid architecture combining the HunyuanVideo model by Tencent and Stable Video Diffusion (SVD) by Stability AI. During inference, the model accepts a user prompt and an optional video input, which are processed before video generation. For Video-to-Video (V2V) generation, the system employs video interpolation techniques to extract frames from the input video. These frames are organized by timestamp and used as image inputs for the Stable Video Diffusion (SVD) model, alongside the user prompt, to generate the final video.
-At the inference stage, the backend dynamically switches between the HunyuanVideo and Stable Video Diffusion (SVD) models based on the input file type. By default, V1 uses a fine-tuned version of the HunyuanVideo model. However, when a video file is detected in the user input, the system automatically switches to the Stable Video Diffusion (SVD) model, enabling a "Video-to-Video" generation workflow.
 ## 📦 Model Introduction
 | Model Name      | Resolution | Video Length | FPS |
 |-----------------|------------|--------------|-----|

 ## 🔑 Key Features
+### 1. Video-to-Video Generation Pipeline
+The V1 model is a hybrid architecture combining the HunyuanVideo model by Tencent and Stable Video Diffusion (SVD) by Stability AI. During inference, the model accepts a user prompt and an optional video input, which are processed before video generation. For Video-to-Video (V2V) generation, the system employs video interpolation techniques to extract frames from the input video. These frames are organized by timestamp and used as image inputs for the Stable Video Diffusion (SVD) model, alongside the user prompt, to generate the final video.
+At the inference stage, the backend dynamically switches between the HunyuanVideo and Stable Video Diffusion (SVD) models based on the input file type. By default, V1 uses a fine-tuned version of the HunyuanVideo model. However, when a video file is detected in the user input, the system automatically switches to the Stable Video Diffusion (SVD) model, enabling a "Video-to-Video" generation workflow.
+### 2. Advanced Model Capabilities
 1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
 2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
 3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
+### 3. Self-Developed Data Cleaning and Annotation Pipeline
 Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
 - **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
 - **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
+### 4. Multi-Stage Image-to-Video Pretraining
 Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
 - **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
 - **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
 ## 📦 Model Introduction
 | Model Name      | Resolution | Video Length | FPS |
 |-----------------|------------|--------------|-----|