yitongl
/

5090_test

Model card Files Files and versions

5090_test / custom_code /README.md

Sensen02's picture

Add fake attention V quant alignment fix

4cbc8a6 verified 25 days ago

|

1.61 kB

	# Standalone Source Map

	Important files:

	- `fastvideo/models/dits/wanvideo.py`
	- Defines `WanTransformerBlock_VSA`.
	- Adds `to_gate_compress`.
	- Passes `gate_compress` into self-attention.
	- `fastvideo/attention/backends/sparse_fp4_compress_attn.py`
	- Backend name: `SPARSE_FP4_COMPRESS_ATTN`.
	- Sparse FP4 main branch plus high-precision block-mean compress branch.
	- Prints `FASTVIDEO_BACKEND_CONFIRM: SPARSE_FP4_COMPRESS_ATTN is running`.
	- `fastvideo/attention/backends/sparse_fp4_attn.py`
	- Base sparse FP4 attention without compress branch.
	- `fastvideo/layers/nvfp4_fake_quant_linear.py`
	- `NVFP4FakeQuantReplicatedLinear`.
	- Wan replacement helpers for normal fake-quant QAT and SVD-LoRA variants.
	- `fastvideo/train/models/wan/wan.py`
	- Training-time switches for enabling fake-quant linear and gate quantization.
	- `fastvideo/platforms/interface.py`, `fastvideo/platforms/cuda.py`
	- Registers `SPARSE_FP4_ATTN` and `SPARSE_FP4_COMPRESS_ATTN`.
	- `fastvideo-kernel/python/fastvideo_kernel/...`
	- Block-sparse attention and quantization kernel source used by the backend.

	The full source snapshot in `../repo_source/` is preferred when running the
	included DCP export script. This directory is meant for quick inspection and
	porting into another inference stack.

	## 2026-05-07 Fake Attention Fix

	`../FAKE_ATTENTION_V_QUANT_FIX.md` documents the fake sparse-FP4 attention
	update that aligns fake V quantization with the current real SA3 Vt/PV kernel:
	Q/K still quantize across `D`, while V now uses token-axis per-16 scale groups
	and is stored back in the original V layout.