Instructions to use Perflow-Shuai/LongLive-2.0-5B-NVFP4-4Step with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Wan2.2
How to use Perflow-Shuai/LongLive-2.0-5B-NVFP4-4Step with Wan2.2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
LongLive2.0 5B NVFP4 4-Step Checkpoint
This repository hosts the LongLive2.0 5B NVFP4 4-step checkpoint for inference with the LongLive2.0 release code:
https://github.com/wileewang/LongLive2.0
LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the generator with NVFP4 weight quantization plus optional FP4 KV-cache quantization.
Installation
The NVFP4 path uses a stricter environment than the default BF16 release path. We recommend keeping it in a separate conda environment.
git clone https://github.com/wileewang/LongLive2.0.git
cd LongLive2.0
conda create -n longlive2_nvfp4 python=3.12 -y
conda activate longlive2_nvfp4
pip install -r requirements.txt
pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
torch==2.10.0 torchvision==0.25.0
Build the NVFP4 / FP4 extensions:
cd fouroversix
pip install ninja packaging psutil "setuptools>=77.0.3"
# B200 / GB200 / GB300
export CUDA_ARCHS=100
# RTX 50/60 series, if needed
# export CUDA_ARCHS=120
pip install --no-build-isolation -e .
cd ..
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.8.3
pip install -U pip setuptools wheel ninja packaging
pip install --no-build-isolation -e .
cd ..
cd utils/kernel
python setup.py build_ext --inplace
cd ../..
Quick environment check:
python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
python -c "import flash_attn; print(flash_attn.__version__)"
python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"
The released LongLive2.0 checkpoint is sufficient for standard inference. You only need to download the original Wan2.2-TI2V-5B components if you want to run training, initialize from the original Wan weights, or use code paths that explicitly load the base Wan model files:
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
--local-dir wan_models/Wan2.2-TI2V-5B
Download this checkpoint repository:
huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-4Step \
--local-dir checkpoints/longlive2_5b_nvfp4_4step
Configure Inference
Edit configs/nvfp4/inference_nvfp4.yaml.
For the released 4-step NVFP4 checkpoint, keep
inference.sampling_steps: 4:
checkpoints:
generator_ckpt: checkpoints/longlive2_5b_nvfp4_4step/path/to/generator.pt
lora_ckpt: null
merge_lora: false
data:
data_path: /path/to/inference_prompts
image_or_video_shape:
- 1
- 384
- 48
- 44
- 80
output_folder: videos/longlive2_nvfp4_4step
num_samples: 1
num_output_frames: 384
inference:
sampling_steps: 4
sink_size: 8
guidance_scale: 1.0
multi_shot_sink: true
multi_shot_rope_offset: 8
kv_quant: true
kv_quant_scale_rule: mse
kv_quant_backend: cuda
streaming_vae: false
async_vae: false
vae_type: wan
model_quant: true
model_quant_use_transformer_engine: false
model_quant_scale_rule: mse
model_quant_activation_scale_rule: mse
model_quant_weight_scale_rule: mse
model_quant_gradient_scale_rule: mse
Replace the checkpoint filename above with the actual file in this repository.
If this repository contains a separate DMD LoRA checkpoint instead of a merged
generator, set checkpoints.lora_ckpt to that LoRA file and set
merge_lora: true, then add the LoRA adapter config:
adapter:
type: lora
rank: 128
alpha: 128
dropout: 0.0
dtype: bfloat16
apply_to_critic: true
verbose: true
If checkpoints.lora_ckpt is null, remove the adapter section.
Do not set model_quant_use_transformer_engine: true when loading a FourOverSix
materialized NVFP4 checkpoint. FourOverSix checkpoints store
quantized_weight_* buffers and should be loaded through the FourOverSix path.
Prompt Folder
data.data_path can be either:
- a
.txtfile, where each line is one single-shot prompt; or - a directory of multi-shot prompt folders.
Example multi-shot prompt folder:
inference_prompts/
robot_lab_demo/
0.json
1.json
2.json
shot_durations.txt
Each JSON file contains:
{
"caption": "A compact silver robot with one blue optic explores a clean robotics lab."
}
shot_durations.txt is optional. If provided, each number is the number of
temporal chunks assigned to the corresponding caption, for example:
2 2 4
Run
Single node, 4 GPUs:
torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
--config_path configs/nvfp4/inference_nvfp4.yaml
Single GPU:
python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml
Or use the helper script, which reads NUM_GPUS / num_gpus when provided:
scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml
Outputs are written to output_folder.
Notes
- This model card is for the 4-step NVFP4 checkpoint. Use
inference.sampling_steps: 4. model_quantenables NVFP4 generator inference.inference.kv_quantenables FP4 KV-cache storage and requires theutils/kernelextension.inference.multi_shot_sinkenables the multi-shot attention sink.inference.multi_shot_rope_offsetcontrols the multi-shot RoPE offset.inference.streaming_vae,inference.async_vae,inference.vae_type, andinference.vae_devicecontrol streaming or asynchronous VAE decode.
- Downloads last month
- -