File size: 1,610 Bytes
d4cc469
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
378f6fb
4cbc8a6
378f6fb
4cbc8a6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Standalone Source Map

Important files:

- `fastvideo/models/dits/wanvideo.py`
  - Defines `WanTransformerBlock_VSA`.
  - Adds `to_gate_compress`.
  - Passes `gate_compress` into self-attention.
- `fastvideo/attention/backends/sparse_fp4_compress_attn.py`
  - Backend name: `SPARSE_FP4_COMPRESS_ATTN`.
  - Sparse FP4 main branch plus high-precision block-mean compress branch.
  - Prints `FASTVIDEO_BACKEND_CONFIRM: SPARSE_FP4_COMPRESS_ATTN is running`.
- `fastvideo/attention/backends/sparse_fp4_attn.py`
  - Base sparse FP4 attention without compress branch.
- `fastvideo/layers/nvfp4_fake_quant_linear.py`
  - `NVFP4FakeQuantReplicatedLinear`.
  - Wan replacement helpers for normal fake-quant QAT and SVD-LoRA variants.
- `fastvideo/train/models/wan/wan.py`
  - Training-time switches for enabling fake-quant linear and gate quantization.
- `fastvideo/platforms/interface.py`, `fastvideo/platforms/cuda.py`
  - Registers `SPARSE_FP4_ATTN` and `SPARSE_FP4_COMPRESS_ATTN`.
- `fastvideo-kernel/python/fastvideo_kernel/...`
  - Block-sparse attention and quantization kernel source used by the backend.

The full source snapshot in `../repo_source/` is preferred when running the
included DCP export script. This directory is meant for quick inspection and
porting into another inference stack.

## 2026-05-07 Fake Attention Fix

`../FAKE_ATTENTION_V_QUANT_FIX.md` documents the fake sparse-FP4 attention
update that aligns fake V quantization with the current real SA3 Vt/PV kernel:
Q/K still quantize across `D`, while V now uses token-axis per-16 scale groups
and is stored back in the original V layout.