LongLive2.0 5B NVFP4 4-Step Checkpoint

This repository hosts the LongLive2.0 5B NVFP4 4-step checkpoint for inference with the LongLive2.0 release code:

https://github.com/wileewang/LongLive2.0

LongLive2.0 inference loads the Wan2.2-TI2V-5B generator, applies the few-step DMD adapter when a separate LoRA checkpoint is provided, and runs the generator with NVFP4 weight quantization plus optional FP4 KV-cache quantization.

Installation

The NVFP4 path uses a stricter environment than the default BF16 release path. We recommend keeping it in a separate conda environment.

git clone https://github.com/wileewang/LongLive2.0.git
cd LongLive2.0

conda create -n longlive2_nvfp4 python=3.12 -y
conda activate longlive2_nvfp4

pip install -r requirements.txt
pip install --upgrade --index-url https://download.pytorch.org/whl/cu128 \
  torch==2.10.0 torchvision==0.25.0

Build the NVFP4 / FP4 extensions:

cd fouroversix
pip install ninja packaging psutil "setuptools>=77.0.3"

# B200 / GB200 / GB300
export CUDA_ARCHS=100

# RTX 50/60 series, if needed
# export CUDA_ARCHS=120

pip install --no-build-isolation -e .
cd ..

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.8.3
pip install -U pip setuptools wheel ninja packaging
pip install --no-build-isolation -e .
cd ..

cd utils/kernel
python setup.py build_ext --inplace
cd ../..

Quick environment check:

python -c "import torch, torchvision; print(torch.__version__, torch.version.cuda); print(torchvision.__version__)"
python -c "import flash_attn; print(flash_attn.__version__)"
python -c "import fouroversix; from utils.quant import LongLiveQuantizationConfig, quantize_to_fp4"
python -c "from utils.kernel.kv_dequant import dequantize_kv_cache_fp4"

The released LongLive2.0 checkpoint is sufficient for standard inference. You only need to download the original Wan2.2-TI2V-5B components if you want to run training, initialize from the original Wan weights, or use code paths that explicitly load the base Wan model files:

huggingface-cli download Wan-AI/Wan2.2-TI2V-5B \
  --local-dir wan_models/Wan2.2-TI2V-5B

Download this checkpoint repository:

huggingface-cli download Perflow-Shuai/LongLive-2.0-5B-NVFP4-4Step \
  --local-dir checkpoints/longlive2_5b_nvfp4_4step

Configure Inference

Edit configs/nvfp4/inference_nvfp4.yaml.

For the released 4-step NVFP4 checkpoint, keep inference.sampling_steps: 4:

checkpoints:
  generator_ckpt: checkpoints/longlive2_5b_nvfp4_4step/path/to/generator.pt
  lora_ckpt: null

merge_lora: false

data:
  data_path: /path/to/inference_prompts
  image_or_video_shape:
  - 1
  - 384
  - 48
  - 44
  - 80

output_folder: videos/longlive2_nvfp4_4step
num_samples: 1
num_output_frames: 384

inference:
  sampling_steps: 4
  sink_size: 8
  guidance_scale: 1.0
  multi_shot_sink: true
  multi_shot_rope_offset: 8
  kv_quant: true
  kv_quant_scale_rule: mse
  kv_quant_backend: cuda
  streaming_vae: false
  async_vae: false
  vae_type: wan

model_quant: true
model_quant_use_transformer_engine: false
model_quant_scale_rule: mse
model_quant_activation_scale_rule: mse
model_quant_weight_scale_rule: mse
model_quant_gradient_scale_rule: mse

Replace the checkpoint filename above with the actual file in this repository. If this repository contains a separate DMD LoRA checkpoint instead of a merged generator, set checkpoints.lora_ckpt to that LoRA file and set merge_lora: true, then add the LoRA adapter config:

adapter:
  type: lora
  rank: 128
  alpha: 128
  dropout: 0.0
  dtype: bfloat16
  apply_to_critic: true
  verbose: true

If checkpoints.lora_ckpt is null, remove the adapter section.

Do not set model_quant_use_transformer_engine: true when loading a FourOverSix materialized NVFP4 checkpoint. FourOverSix checkpoints store quantized_weight_* buffers and should be loaded through the FourOverSix path.

Prompt Folder

data.data_path can be either:

  • a .txt file, where each line is one single-shot prompt; or
  • a directory of multi-shot prompt folders.

Example multi-shot prompt folder:

inference_prompts/
  robot_lab_demo/
    0.json
    1.json
    2.json
    shot_durations.txt

Each JSON file contains:

{
  "caption": "A compact silver robot with one blue optic explores a clean robotics lab."
}

shot_durations.txt is optional. If provided, each number is the number of temporal chunks assigned to the corresponding caption, for example:

2 2 4

Run

Single node, 4 GPUs:

torchrun --standalone --nnodes=1 --nproc_per_node=4 inference.py \
  --config_path configs/nvfp4/inference_nvfp4.yaml

Single GPU:

python inference.py --config_path configs/nvfp4/inference_nvfp4.yaml

Or use the helper script, which reads NUM_GPUS / num_gpus when provided:

scripts/inference_nvfp4.sh configs/nvfp4/inference_nvfp4.yaml

Outputs are written to output_folder.

Notes

  • This model card is for the 4-step NVFP4 checkpoint. Use inference.sampling_steps: 4.
  • model_quant enables NVFP4 generator inference.
  • inference.kv_quant enables FP4 KV-cache storage and requires the utils/kernel extension.
  • inference.multi_shot_sink enables the multi-shot attention sink.
  • inference.multi_shot_rope_offset controls the multi-shot RoPE offset.
  • inference.streaming_vae, inference.async_vae, inference.vae_type, and inference.vae_device control streaming or asynchronous VAE decode.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support