V1 / README.md

NullVoider

Update README.md

e8c9a76 verified about 1 year ago

3.3 kB

language:
  - en
base_model:
  - tencent/HunyuanVideo
  - stabilityai/stable-video-diffusion-img2vid-xt
tag:
  - text-to-video
  - image-to-video
  - video-to-video

V1: Human-Centric Video Foundation Model

🌐 Github

This repo contains Diffusers-format model weights for V1 Text-to-Video, Image-to-Video models, and Video-to-Video. You can find the inference code on our github repository V1.

Introduction

V1 is an open-source human-centric video foundation model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips, V1 offers three key advantages:

🔑 Key Features

1. Advanced Model Capabilities

Open-Source Leadership: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.

2. Self-Developed Data Cleaning and Annotation Pipeline

Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.

Expression Classification: Categorizes human facial expressions into 33 distinct types.
Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots.

3. Multi-Stage Image-to-Video Pretraining

Our multi-stage pretraining pipeline, inspired by the HunyuanVideo design, consists of the following stages:

Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.

📦 Model Introduction

Model Name	Resolution	Video Length	FPS
V1-Hunyuan-I2V	544px960p	97	24
V1-Hunyuan-T2V	544px960p	97	24
V1-SVD-V2V	544px960p	97	24

Usage

See the Guide for details.