File size: 2,437 Bytes

1d0c0cc

# Standalone Inference Helper

This folder contains a portable inference helper for:

`sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700`

It is not a full vendored copy of Wan or FastVideo.  It contains the sparse FP4
backend overlay and a runner that can be applied to a FastVideo checkout or
installation so the uploaded checkpoint can be used for normal inference.

## Contents

- `run_inference.py`: downloads/loads `transformer/diffusion_pytorch_model.safetensors` from `yitongl/sparse_quant_exp` and runs `VideoGenerator`.
- `run.sh`: convenience wrapper that installs the overlay into `FASTVIDEO_ROOT` and then runs `run_inference.py`.
- `install_overlay.py`: copies the bundled sparse FP4 backend files into a FastVideo checkout/install.
- `overlay_files/`: exact runtime source files needed by `SPARSE_FP4_OURS_P_ATTN`.
- `training_attention_settings.json`: structured settings for the uploaded checkpoint.

## Expected Environment

- A working FastVideo Python environment.
- FastVideo dependencies installed, including PyTorch, Triton, safetensors, and
  Hugging Face Hub.
- Access to the base model `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
- A CUDA GPU supported by the custom Triton kernels.

## Usage

From a machine with this HF repo downloaded:

```bash
export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
  --output-path outputs/sfp4_checkpoint_700 \
  --seed 1000
```

The script sets:

```bash
FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN
FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1
```

and downloads the uploaded checkpoint-700 transformer weights unless `--weights`
is provided.

To use a local safetensors file:

```bash
export FASTVIDEO_ROOT=/path/to/FastVideo
bash standalone_inference/run.sh \
  --weights /path/to/diffusion_pytorch_model.safetensors \
  --prompt "your prompt"
```

## Attention Semantics

- Self-attention uses `SPARSE_FP4_OURS_P_ATTN`.
- Q/K/V use FP4 fake quantization with STE.
- VSA tile size is `4 x 4 x 4 = 64` tokens.
- Selected sparse tiles use group-local P quantization in the Triton kernel.
- Dropped tiles use tile mean compensation.
- Cross-attention falls back to dense SDPA and is not sparse/FP4.

## Checkpoint

The current HF `main` transformer file is checkpoint-700:

`transformer/diffusion_pytorch_model.safetensors`

Local SHA256 used when preparing this helper:

`4595ca81ea7085c15ccf14b738aa9c0fdf2d2786641f49b55e0bc0e99bf042d2`