Add standalone inference helper for sfp4 checkpoint-700

1d0c0cc verified 15 days ago

2.44 kB

	# Standalone Inference Helper

	This folder contains a portable inference helper for:

	`sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700`

	It is not a full vendored copy of Wan or FastVideo. It contains the sparse FP4
	backend overlay and a runner that can be applied to a FastVideo checkout or
	installation so the uploaded checkpoint can be used for normal inference.

	## Contents

	- `run_inference.py`: downloads/loads `transformer/diffusion_pytorch_model.safetensors` from `yitongl/sparse_quant_exp` and runs `VideoGenerator`.
	- `run.sh`: convenience wrapper that installs the overlay into `FASTVIDEO_ROOT` and then runs `run_inference.py`.
	- `install_overlay.py`: copies the bundled sparse FP4 backend files into a FastVideo checkout/install.
	- `overlay_files/`: exact runtime source files needed by `SPARSE_FP4_OURS_P_ATTN`.
	- `training_attention_settings.json`: structured settings for the uploaded checkpoint.

	## Expected Environment

	- A working FastVideo Python environment.
	- FastVideo dependencies installed, including PyTorch, Triton, safetensors, and
	Hugging Face Hub.
	- Access to the base model `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
	- A CUDA GPU supported by the custom Triton kernels.

	## Usage

	From a machine with this HF repo downloaded:

	```bash
	export FASTVIDEO_ROOT=/path/to/FastVideo
	bash standalone_inference/run.sh \
	--output-path outputs/sfp4_checkpoint_700 \
	--seed 1000
	```

	The script sets:

	```bash
	FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN
	FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1
	```

	and downloads the uploaded checkpoint-700 transformer weights unless `--weights`
	is provided.

	To use a local safetensors file:

	```bash
	export FASTVIDEO_ROOT=/path/to/FastVideo
	bash standalone_inference/run.sh \
	--weights /path/to/diffusion_pytorch_model.safetensors \
	--prompt "your prompt"
	```

	## Attention Semantics

	- Self-attention uses `SPARSE_FP4_OURS_P_ATTN`.
	- Q/K/V use FP4 fake quantization with STE.
	- VSA tile size is `4 x 4 x 4 = 64` tokens.
	- Selected sparse tiles use group-local P quantization in the Triton kernel.
	- Dropped tiles use tile mean compensation.
	- Cross-attention falls back to dense SDPA and is not sparse/FP4.

	## Checkpoint

	The current HF `main` transformer file is checkpoint-700:

	`transformer/diffusion_pytorch_model.safetensors`

	Local SHA256 used when preparing this helper:

	`4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2`