5090_test / custom_code /README.md
Sensen02's picture
Add fake attention V quant alignment fix
4cbc8a6 verified
|
raw
history blame
1.61 kB

Standalone Source Map

Important files:

  • fastvideo/models/dits/wanvideo.py
    • Defines WanTransformerBlock_VSA.
    • Adds to_gate_compress.
    • Passes gate_compress into self-attention.
  • fastvideo/attention/backends/sparse_fp4_compress_attn.py
    • Backend name: SPARSE_FP4_COMPRESS_ATTN.
    • Sparse FP4 main branch plus high-precision block-mean compress branch.
    • Prints FASTVIDEO_BACKEND_CONFIRM: SPARSE_FP4_COMPRESS_ATTN is running.
  • fastvideo/attention/backends/sparse_fp4_attn.py
    • Base sparse FP4 attention without compress branch.
  • fastvideo/layers/nvfp4_fake_quant_linear.py
    • NVFP4FakeQuantReplicatedLinear.
    • Wan replacement helpers for normal fake-quant QAT and SVD-LoRA variants.
  • fastvideo/train/models/wan/wan.py
    • Training-time switches for enabling fake-quant linear and gate quantization.
  • fastvideo/platforms/interface.py, fastvideo/platforms/cuda.py
    • Registers SPARSE_FP4_ATTN and SPARSE_FP4_COMPRESS_ATTN.
  • fastvideo-kernel/python/fastvideo_kernel/...
    • Block-sparse attention and quantization kernel source used by the backend.

The full source snapshot in ../repo_source/ is preferred when running the included DCP export script. This directory is meant for quick inspection and porting into another inference stack.

2026-05-07 Fake Attention Fix

../FAKE_ATTENTION_V_QUANT_FIX.md documents the fake sparse-FP4 attention update that aligns fake V quantization with the current real SA3 Vt/PV kernel: Q/K still quantize across D, while V now uses token-axis per-16 scale groups and is stored back in the original V layout.