| # Standalone Inference Helper |
|
|
| This folder contains a portable inference helper for: |
|
|
| `sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700` |
|
|
| It is not a full vendored copy of Wan or FastVideo. It contains the sparse FP4 |
| backend overlay and a runner that can be applied to a FastVideo checkout or |
| installation so the uploaded checkpoint can be used for normal inference. |
|
|
| ## Contents |
|
|
| - `run_inference.py`: downloads/loads `transformer/diffusion_pytorch_model.safetensors` from `yitongl/sparse_quant_exp` and runs `VideoGenerator`. |
| - `run.sh`: convenience wrapper that installs the overlay into `FASTVIDEO_ROOT` and then runs `run_inference.py`. |
| - `install_overlay.py`: copies the bundled sparse FP4 backend files into a FastVideo checkout/install. |
| - `overlay_files/`: exact runtime source files needed by `SPARSE_FP4_OURS_P_ATTN`. |
| - `training_attention_settings.json`: structured settings for the uploaded checkpoint. |
|
|
| ## Expected Environment |
|
|
| - A working FastVideo Python environment. |
| - FastVideo dependencies installed, including PyTorch, Triton, safetensors, and |
| Hugging Face Hub. |
| - Access to the base model `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`. |
| - A CUDA GPU supported by the custom Triton kernels. |
|
|
| ## Usage |
|
|
| From a machine with this HF repo downloaded: |
|
|
| ```bash |
| export FASTVIDEO_ROOT=/path/to/FastVideo |
| bash standalone_inference/run.sh \ |
| --output-path outputs/sfp4_checkpoint_700 \ |
| --seed 1000 |
| ``` |
|
|
| The script sets: |
|
|
| ```bash |
| FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN |
| FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1 |
| ``` |
|
|
| and downloads the uploaded checkpoint-700 transformer weights unless `--weights` |
| is provided. |
|
|
| To use a local safetensors file: |
|
|
| ```bash |
| export FASTVIDEO_ROOT=/path/to/FastVideo |
| bash standalone_inference/run.sh \ |
| --weights /path/to/diffusion_pytorch_model.safetensors \ |
| --prompt "your prompt" |
| ``` |
|
|
| ## Attention Semantics |
|
|
| - Self-attention uses `SPARSE_FP4_OURS_P_ATTN`. |
| - Q/K/V use FP4 fake quantization with STE. |
| - VSA tile size is `4 x 4 x 4 = 64` tokens. |
| - Selected sparse tiles use group-local P quantization in the Triton kernel. |
| - Dropped tiles use tile mean compensation. |
| - Cross-attention falls back to dense SDPA and is not sparse/FP4. |
|
|
| ## Checkpoint |
|
|
| The current HF `main` transformer file is checkpoint-700: |
|
|
| `transformer/diffusion_pytorch_model.safetensors` |
|
|
| Local SHA256 used when preparing this helper: |
|
|
| `4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2` |
|
|