Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,58 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
base_model:
|
| 5 |
+
- tencent/HunyuanVideo
|
| 6 |
+
- stabilityai/stable-video-diffusion-img2vid-xt
|
| 7 |
+
tag:
|
| 8 |
+
- text-to-video
|
| 9 |
+
- image-to-video
|
| 10 |
+
- video-to-video
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# V1: Human-Centric Video Foundation Model
|
| 14 |
+
|
| 15 |
+
<p align="center">
|
| 16 |
+
<a href="https://github.com/immortalshadow007/V1" target="_blank">🌐 Github</a>
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
This repo contains Diffusers-format model weights for V1 Text-to-Video, Image-to-Video models, and Video-to-Video. You can find the inference code on our github repository [V1](https://github.com/immortalshadow007/V1).
|
| 20 |
+
|
| 21 |
+
## Introduction
|
| 22 |
+
V1 is an open-source human-centric video foundation model. By fine-tuning <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> on O(10M) high-quality film and television clips, V1 offers three key advantages:
|
| 23 |
+
|
| 24 |
+
## 🔑 Key Features
|
| 25 |
+
|
| 26 |
+
### 1. Advanced Model Capabilities
|
| 27 |
+
|
| 28 |
+
1. **Open-Source Leadership**: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
|
| 29 |
+
2. **Advanced Facial Animation**: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
|
| 30 |
+
3. **Cinematic Lighting and Aesthetics**: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.
|
| 31 |
+
|
| 32 |
+
### 2. Self-Developed Data Cleaning and Annotation Pipeline
|
| 33 |
+
|
| 34 |
+
Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.
|
| 35 |
+
|
| 36 |
+
- **Expression Classification**: Categorizes human facial expressions into 33 distinct types.
|
| 37 |
+
- **Character Spatial Awareness**: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
|
| 38 |
+
- **Action Recognition**: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
|
| 39 |
+
- **Scene Understanding**: Conducts cross-modal correlation analysis of clothing, scenes, and plots.
|
| 40 |
+
|
| 41 |
+
### 3. Multi-Stage Image-to-Video Pretraining
|
| 42 |
+
|
| 43 |
+
Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:
|
| 44 |
+
|
| 45 |
+
- **Stage 1: Model Domain Transfer Pretraining**: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
|
| 46 |
+
- **Stage 2: Image-to-Video Model Pretraining**: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
|
| 47 |
+
- **Stage 3: High-Quality Fine-Tuning**: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.
|
| 48 |
+
|
| 49 |
+
## 📦 Model Introduction
|
| 50 |
+
| Model Name | Resolution | Video Length | FPS |
|
| 51 |
+
|-----------------|------------|--------------|-----|
|
| 52 |
+
| V1-Hunyuan-I2V | 544px960p | 97 | 24 |
|
| 53 |
+
| V1-Hunyuan-T2V | 544px960p | 97 | 24 |
|
| 54 |
+
| V1-SVD-V2V | 544px960p | 97 | 24 |
|
| 55 |
+
|
| 56 |
+
## Usage
|
| 57 |
+
**See the [Guide](https://github.com/immortalshadow007/V1) for details.**
|
| 58 |
+
```
|