|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- diffusion-single-file |
|
|
- comfyui |
|
|
- distillation |
|
|
- video |
|
|
- video genration |
|
|
base_model: |
|
|
- tencent/HunyuanVideo-1.5 |
|
|
pipeline_tags: |
|
|
- text-to-video |
|
|
library_name: diffusers |
|
|
pipeline_tag: text-to-video |
|
|
--- |
|
|
|
|
|
# π¬ Hy1.5-Distill-Models |
|
|
|
|
|
<img src="https://raw.githubusercontent.com/ModelTC/LightX2V/main/assets/img_lightx2v.png" width="75%" /> |
|
|
|
|
|
--- |
|
|
|
|
|
π€ [HuggingFace](https://huggingface.co/lightx2v/Hy1.5-Distill-Models) | [GitHub](https://github.com/ModelTC/LightX2V) | [License](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
--- |
|
|
|
|
|
This repository contains 4-step distilled models for HunyuanVideo-1.5 optimized for use with LightX2V. These distilled models enable **ultra-fast 4-step inference** without CFG (Classifier-Free Guidance), significantly reducing generation time while maintaining high-quality video output. |
|
|
|
|
|
## π Model List |
|
|
|
|
|
### 4-Step Distilled Models |
|
|
|
|
|
* **`hy1.5_t2v_480p_lightx2v_4step.safetensors`** - 480p Text-to-Video 4-step distilled model (16.7 GB) |
|
|
* **`hy1.5_t2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors`** - 480p Text-to-Video 4-step distilled model with FP8 quantization (8.85 GB) |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
First, install LightX2V: |
|
|
|
|
|
```bash |
|
|
pip install -v git+https://github.com/ModelTC/LightX2V.git |
|
|
``` |
|
|
|
|
|
Or build from source: |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/ModelTC/LightX2V.git |
|
|
cd LightX2V |
|
|
pip install -v -e . |
|
|
``` |
|
|
|
|
|
### Download Models |
|
|
|
|
|
Download the distilled models from this repository: |
|
|
|
|
|
```bash |
|
|
# Using git-lfs |
|
|
git lfs install |
|
|
git clone https://huggingface.co/lightx2v/Hy1.5-Distill-Models |
|
|
|
|
|
# Or download individual files using huggingface-hub |
|
|
pip install huggingface-hub |
|
|
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='lightx2v/Hy1.5-Distill-Models', filename='hy1.5_t2v_480p_lightx2v_4step.safetensors', local_dir='./models')" |
|
|
``` |
|
|
|
|
|
## π» Usage in LightX2V |
|
|
|
|
|
### 4-Step Distilled Model (Base Version) |
|
|
|
|
|
```python |
|
|
""" |
|
|
HunyuanVideo-1.5 text-to-video generation example. |
|
|
This example demonstrates how to use LightX2V with HunyuanVideo-1.5 4-step distilled model for T2V generation. |
|
|
""" |
|
|
|
|
|
from lightx2v import LightX2VPipeline |
|
|
|
|
|
# Initialize pipeline for HunyuanVideo-1.5 |
|
|
pipe = LightX2VPipeline( |
|
|
model_path="/path/to/hunyuanvideo-1.5/", # Original model path |
|
|
model_cls="hunyuan_video_1.5", |
|
|
transformer_model_name="480p_t2v", |
|
|
task="t2v", |
|
|
# 4-step distilled model ckpt |
|
|
dit_original_ckpt="/path/to/hy1.5_t2v_480p_lightx2v_4step.safetensors" |
|
|
) |
|
|
|
|
|
# Alternative: create generator from config JSON file |
|
|
# pipe.create_generator(config_json="../configs/hunyuan_video_15/hunyuan_video_t2v_480p.json") |
|
|
|
|
|
# Enable offloading to significantly reduce VRAM usage with minimal speed impact |
|
|
# Suitable for RTX 30/40/50 consumer GPUs |
|
|
pipe.enable_offload( |
|
|
cpu_offload=True, |
|
|
offload_granularity="block", # For HunyuanVideo-1.5, only "block" is supported |
|
|
text_encoder_offload=True, |
|
|
image_encoder_offload=False, |
|
|
vae_offload=False, |
|
|
) |
|
|
|
|
|
# Optional: Use lighttae |
|
|
# pipe.enable_lightvae( |
|
|
# use_tae=True, |
|
|
# tae_path="/path/to/lighttaehy1_5.safetensors", |
|
|
# use_lightvae=False, |
|
|
# vae_path=None, |
|
|
# ) |
|
|
|
|
|
# Create generator with specified parameters |
|
|
# Note: 4-step distillation requires infer_steps=4, guidance_scale=1, and denoising_step_list |
|
|
pipe.create_generator( |
|
|
attn_mode="sage_attn2", |
|
|
infer_steps=4, # 4-step inference |
|
|
num_frames=81, |
|
|
guidance_scale=1, # No CFG needed for distilled models |
|
|
sample_shift=9.0, |
|
|
aspect_ratio="16:9", |
|
|
fps=16, |
|
|
denoising_step_list=[1000, 750, 500, 250] # Required for 4-step distillation |
|
|
) |
|
|
|
|
|
# Generation parameters |
|
|
seed = 123 |
|
|
prompt = "A close-up shot captures a scene on a polished, light-colored granite kitchen counter, illuminated by soft natural light from an unseen window. Initially, the frame focuses on a tall, clear glass filled with golden, translucent apple juice standing next to a single, shiny red apple with a green leaf still attached to its stem. The camera moves horizontally to the right. As the shot progresses, a white ceramic plate smoothly enters the frame, revealing a fresh arrangement of about seven or eight more apples, a mix of vibrant reds and greens, piled neatly upon it. A shallow depth of field keeps the focus sharply on the fruit and glass, while the kitchen backsplash in the background remains softly blurred. The scene is in a realistic style." |
|
|
negative_prompt = "" |
|
|
save_result_path = "/path/to/save_results/output.mp4" |
|
|
|
|
|
# Generate video |
|
|
pipe.generate( |
|
|
seed=seed, |
|
|
prompt=prompt, |
|
|
negative_prompt=negative_prompt, |
|
|
save_result_path=save_result_path, |
|
|
) |
|
|
``` |
|
|
|
|
|
### 4-Step Distilled Model with FP8 Quantization |
|
|
|
|
|
For even lower memory usage, use the FP8 quantized version: |
|
|
|
|
|
```python |
|
|
from lightx2v import LightX2VPipeline |
|
|
|
|
|
# Initialize pipeline |
|
|
pipe = LightX2VPipeline( |
|
|
model_path="/path/to/hunyuanvideo-1.5/", # Original model path |
|
|
model_cls="hunyuan_video_1.5", |
|
|
transformer_model_name="480p_t2v", |
|
|
task="t2v", |
|
|
# 4-step distilled model ckpt |
|
|
dit_original_ckpt="/path/to/hy1.5_t2v_480p_lightx2v_4step.safetensors" |
|
|
) |
|
|
|
|
|
# Enable FP8 quantization for the distilled model |
|
|
pipe.enable_quantize( |
|
|
quant_scheme='fp8-sgl', |
|
|
dit_quantized=True, |
|
|
dit_quantized_ckpt="/path/to/hy1.5_t2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors", |
|
|
text_encoder_quantized=False, # Optional: can also quantize text encoder |
|
|
text_encoder_quantized_ckpt="/path/to/hy15_qwen25vl_llm_encoder_fp8_e4m3_lightx2v.safetensors", # Optional |
|
|
image_encoder_quantized=False, |
|
|
) |
|
|
|
|
|
# Enable offloading for lower VRAM usage |
|
|
pipe.enable_offload( |
|
|
cpu_offload=True, |
|
|
offload_granularity="block", |
|
|
text_encoder_offload=True, |
|
|
image_encoder_offload=False, |
|
|
vae_offload=False, |
|
|
) |
|
|
|
|
|
# Create generator |
|
|
pipe.create_generator( |
|
|
attn_mode="sage_attn2", |
|
|
infer_steps=4, |
|
|
num_frames=81, |
|
|
guidance_scale=1, |
|
|
sample_shift=9.0, |
|
|
aspect_ratio="16:9", |
|
|
fps=16, |
|
|
denoising_step_list=[1000, 750, 500, 250] |
|
|
) |
|
|
|
|
|
# Generate video |
|
|
pipe.generate( |
|
|
seed=123, |
|
|
prompt="Your prompt here", |
|
|
negative_prompt="", |
|
|
save_result_path="/path/to/output.mp4", |
|
|
) |
|
|
``` |
|
|
|
|
|
## βοΈ Key Features |
|
|
|
|
|
### 4-Step Distillation |
|
|
|
|
|
These models use **step distillation** technology to compress the original 50-step inference process into just **4 steps**, providing: |
|
|
|
|
|
* **π Ultra-Fast Inference**: Generate videos in a fraction of the time |
|
|
* **π‘ No CFG Required**: Set `guidance_scale=1` (no classifier-free guidance needed) |
|
|
* **π Quality Preservation**: Maintains high visual quality despite fewer steps |
|
|
* **πΎ Lower Memory**: Reduced computational requirements |
|
|
|
|
|
### FP8 Quantization (Optional) |
|
|
|
|
|
The FP8 quantized version (`hy1.5_t2v_480p_scaled_fp8_e4m3_lightx2v_4step.safetensors`) provides additional benefits: |
|
|
|
|
|
* **50% Memory Reduction**: Further reduces VRAM usage |
|
|
* **Faster Computation**: Optimized quantized kernels |
|
|
* **Maintained Quality**: FP8 quantization preserves visual quality |
|
|
|
|
|
### Requirements |
|
|
|
|
|
For FP8 quantized models, you need to install the SGL kernel: |
|
|
|
|
|
```bash |
|
|
# Requires torch == 2.8.0 |
|
|
pip install sgl-kernel --upgrade |
|
|
``` |
|
|
|
|
|
Alternatively, you can use VLLM kernels: |
|
|
|
|
|
```bash |
|
|
pip install vllm |
|
|
``` |
|
|
|
|
|
## π Performance Benefits |
|
|
|
|
|
Using 4-step distilled models provides: |
|
|
|
|
|
* **~25x Speedup**: Compared to standard 50-step inference |
|
|
* **Lower VRAM Requirements**: Enables running on GPUs with less memory |
|
|
* **No CFG Overhead**: Eliminates the need for classifier-free guidance computation |
|
|
* **Production Ready**: Fast enough for real-time or near-real-time applications |
|
|
|
|
|
## π Related Resources |
|
|
|
|
|
* [LightX2V GitHub Repository](https://github.com/ModelTC/LightX2V) |
|
|
* [LightX2V Documentation](https://lightx2v-en.readthedocs.io/en/latest/) |
|
|
* [HunyuanVideo-1.5 Original Model](https://huggingface.co/tencent/HunyuanVideo-1.5) |
|
|
* [Hy1.5-Quantized-Models](https://huggingface.co/lightx2v/Hy1.5-Quantized-Models) - For quantized inference without distillation |
|
|
* [LightX2V Examples](https://github.com/ModelTC/LightX2V/tree/main/examples) |
|
|
* [Step Distillation Documentation](https://lightx2v-en.readthedocs.io/en/latest/method_tutorials/step_distill.html) |
|
|
|
|
|
## π Important Notes |
|
|
|
|
|
* **Critical Configuration**: |
|
|
- Must set `infer_steps=4` (not the default 50) |
|
|
- Must set `guidance_scale=1` (CFG is not used in distilled models) |
|
|
- Must provide `denoising_step_list=[1000, 750, 500, 250]` |
|
|
|
|
|
* **Model Loading**: All advanced configurations (including `enable_quantize()` and `enable_offload()`) must be called **before** `create_generator()`, otherwise they will not take effect. |
|
|
|
|
|
* **Original Model Required**: The original HunyuanVideo-1.5 model weights are still required. The distilled model is used in conjunction with the original model structure. |
|
|
|
|
|
* **Attention Mode**: For best performance, we recommend using SageAttention 2 (`sage_attn2`) as the attention mode. |
|
|
|
|
|
* **Resolution**: Currently supports 480p resolution. Higher resolutions may be available in future releases. |
|
|
|
|
|
## π€ Citation |
|
|
|
|
|
If you use these distilled models in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{lightx2v, |
|
|
author = {LightX2V Contributors}, |
|
|
title = {LightX2V: Light Video Generation Inference Framework}, |
|
|
year = {2025}, |
|
|
publisher = {GitHub}, |
|
|
journal = {GitHub repository}, |
|
|
howpublished = {\url{https://github.com/ModelTC/lightx2v}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the Apache 2.0 License, same as the original HunyuanVideo-1.5 model. |
|
|
|
|
|
|