yitongl's picture
Add standalone inference helper for sfp4 checkpoint-700
1d0c0cc verified
# Standalone Inference Helper
This folder contains a portable inference helper for:
`sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700`
It is not a full vendored copy of Wan or FastVideo. It contains the sparse FP4
backend overlay and a runner that can be applied to a FastVideo checkout or
installation so the uploaded checkpoint can be used for normal inference.
## Contents
- `run_inference.py`: downloads/loads `transformer/diffusion_pytorch_model.safetensors` from `yitongl/sparse_quant_exp` and runs `VideoGenerator`.
- `run.sh`: convenience wrapper that installs the overlay into `FASTVIDEO_ROOT` and then runs `run_inference.py`.
- `install_overlay.py`: copies the bundled sparse FP4 backend files into a FastVideo checkout/install.
- `overlay_files/`: exact runtime source files needed by `SPARSE_FP4_OURS_P_ATTN`.
- `training_attention_settings.json`: structured settings for the uploaded checkpoint.
## Expected Environment
- A working FastVideo Python environment.
- FastVideo dependencies installed, including PyTorch, Triton, safetensors, and
Hugging Face Hub.
- Access to the base model `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
- A CUDA GPU supported by the custom Triton kernels.
## Usage
From a machine with this HF repo downloaded:
```bash
export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
--output-path outputs/sfp4_checkpoint_700 \
--seed 1000
```
The script sets:
```bash
FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN
FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1
```
and downloads the uploaded checkpoint-700 transformer weights unless `--weights`
is provided.
To use a local safetensors file:
```bash
export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
--weights /path/to/diffusion_pytorch_model.safetensors \
--prompt "your prompt"
```
## Attention Semantics
- Self-attention uses `SPARSE_FP4_OURS_P_ATTN`.
- Q/K/V use FP4 fake quantization with STE.
- VSA tile size is `4 x 4 x 4 = 64` tokens.
- Selected sparse tiles use group-local P quantization in the Triton kernel.
- Dropped tiles use tile mean compensation.
- Cross-attention falls back to dense SDPA and is not sparse/FP4.
## Checkpoint
The current HF `main` transformer file is checkpoint-700:
`transformer/diffusion_pytorch_model.safetensors`
Local SHA256 used when preparing this helper:
`4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2`