# Standalone Inference Helper This folder contains a portable inference helper for: `sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700` It is not a full vendored copy of Wan or FastVideo. It contains the sparse FP4 backend overlay and a runner that can be applied to a FastVideo checkout or installation so the uploaded checkpoint can be used for normal inference. ## Contents - `run_inference.py`: downloads/loads `transformer/diffusion_pytorch_model.safetensors` from `yitongl/sparse_quant_exp` and runs `VideoGenerator`. - `run.sh`: convenience wrapper that installs the overlay into `FASTVIDEO_ROOT` and then runs `run_inference.py`. - `install_overlay.py`: copies the bundled sparse FP4 backend files into a FastVideo checkout/install. - `overlay_files/`: exact runtime source files needed by `SPARSE_FP4_OURS_P_ATTN`. - `training_attention_settings.json`: structured settings for the uploaded checkpoint. ## Expected Environment - A working FastVideo Python environment. - FastVideo dependencies installed, including PyTorch, Triton, safetensors, and Hugging Face Hub. - Access to the base model `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. - A CUDA GPU supported by the custom Triton kernels. ## Usage From a machine with this HF repo downloaded: ```bash export FASTVIDEO_ROOT=/path/to/FastVideo bash standalone_inference/run.sh \ --output-path outputs/sfp4_checkpoint_700 \ --seed 1000 ``` The script sets: ```bash FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1 ``` and downloads the uploaded checkpoint-700 transformer weights unless `--weights` is provided. To use a local safetensors file: ```bash export FASTVIDEO_ROOT=/path/to/FastVideo bash standalone_inference/run.sh \ --weights /path/to/diffusion_pytorch_model.safetensors \ --prompt "your prompt" ``` ## Attention Semantics - Self-attention uses `SPARSE_FP4_OURS_P_ATTN`. - Q/K/V use FP4 fake quantization with STE. - VSA tile size is `4 x 4 x 4 = 64` tokens. - Selected sparse tiles use group-local P quantization in the Triton kernel. - Dropped tiles use tile mean compensation. - Cross-attention falls back to dense SDPA and is not sparse/FP4. ## Checkpoint The current HF `main` transformer file is checkpoint-700: `transformer/diffusion_pytorch_model.safetensors` Local SHA256 used when preparing this helper: `4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2`