Instructions to use yitongl/sparse_quant_exp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use yitongl/sparse_quant_exp with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("yitongl/sparse_quant_exp", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
| # Backend snapshot for checkpoint-700 | |
| This directory is the code snapshot for the training backend used by: | |
| `sfp4_v4_sparse09_hpo_on_ours_p_init2050_1n_interactive/checkpoint-700` | |
| Key runtime settings: | |
| - `FASTVIDEO_ATTENTION_BACKEND=SPARSE_FP4_OURS_P_ATTN` | |
| - `FASTVIDEO_SPARSE_FP4_USE_HIGH_PREC_O=1` | |
| - `VSA_SPARSITY=0.9` | |
| - `VSA_INIT_SPARSITY=0.9` | |
| - `VSA_WARMUP_STEPS=0` | |
| - tile size: `4 x 4 x 4 = 64` video tokens | |
| Training attention semantics: | |
| - Video self-attention uses `SPARSE_FP4_OURS_P_ATTN`. | |
| - Cross-attention is not quantized/sparse in this backend. It falls back to | |
| dense SDPA when `query_length != key_length`. | |
| - `force_dense` paths also use dense SDPA. | |
| - Q/K/V fake quantization uses FP4 with STE and no q/k mean subtraction. | |
| - Selected sparse tiles use group-local P quantization in the Triton kernel. | |
| - Dropped VSA tiles use tile-level q_mean/k_mean score plus mean_v | |
| compensation. | |
| Important files: | |
| - `fastvideo/attention/backends/sparse_fp4_ours_p_attn.py`: Python attention backend, Q/K/V fake quantization, top-k block map, tile mean setup. | |
| - `fastvideo-kernel/python/fastvideo_kernel/block_sparse_attn_ours_p.py`: PyTorch custom op and autograd wrapper. | |
| - `fastvideo-kernel/python/fastvideo_kernel/triton_kernels/block_sparse_attn_triton_ours_p.py`: Triton forward/backward kernel. | |
| - `fastvideo-kernel/python/fastvideo_kernel/triton_kernels/nvfp4_utils.py`: FP4 quant/dequant utilities used by the kernel. | |
| - `fastvideo-kernel/python/fastvideo_kernel/triton_kernels/quant_utils.py`: Q/K/V fake quant kernels. | |
| - `fastvideo/attention/backends/video_sparse_attn.py`: VSA metadata and tile-size helper. | |
| - `fastvideo/platforms/interface.py` and `fastvideo/platforms/cuda.py`: backend enum and CUDA backend selection wiring. | |
| - `fastvideo/training/training_pipeline.py` and `fastvideo/training/wan_training_pipeline.py`: legacy SFT training path used by the launch script. | |
| - `scripts/training/run_sparse_fp4_train_v4_1n_sparse09_hpo_on_ours_p_init2050_interactive.sh`: exact Slurm wrapper for this run. | |
| - `scripts/training/run_sparse_fp4_train_v4_common.sh`: common SFT launch/resume script. | |
| - `training_attention_settings.json`: structured attention/training settings | |
| for this checkpoint. | |
| - `scripts/inference/run_sfp4_ours_p_checkpoint_700.sh`: inference example | |
| for the uploaded transformer checkpoint. | |
| - `fastvideo/entrypoints/cli/generate.py`, `fastvideo/entrypoints/video_generator.py`, | |
| `fastvideo/pipelines/basic/wan/wan_pipeline.py`, and | |
| `fastvideo/pipelines/stages/denoising.py`: `fastvideo generate` inference | |
| path used by the example script. | |
| Example inference flow: | |
| ```bash | |
| hf download yitongl/sparse_quant_exp \ | |
| --repo-type model \ | |
| --local-dir checkpoints/hf_download/sparse_quant_exp \ | |
| --include 'transformer/*' | |
| bash backend_snapshot/scripts/inference/run_sfp4_ours_p_checkpoint_700.sh | |
| ``` | |
| Source repo HEAD when staged: | |
| `3f818d0fc532ec6494b465967d5f485150917d0c` | |
| Note: several backend files were uncommitted or locally modified when this | |
| snapshot was staged, so the files here are the authoritative copy for this | |
| checkpoint rather than the clean git commit alone. | |