Is the QK^T result the VRAM bottleneck for video models?

#10
by yunming181920 - opened

For a 10241024120 video, with 8x spatial VAE downsampling, 4x temporal downsampling, and 2x patching, the latent dimension becomes 64 *64 *30. Since the QK^T shape is L^2, the memory consumption reaches 30 GB. I think this is the reason why though the model is small but the vram demand is still high.

Motif Technologies org

Hi @yunming181920 ,

Good observation on the theoretical L^2 cost.
In practice, FlashAttention/SDPA uses tiling so the full QK^T matrix is never materialized in memory.
The actual VRAM bottleneck is mostly model weights + intermediate activations rather than the attention score matrix itself.

That said, we've found SageAttention helps reduce the memory bandwidth cost of attention by ~1.6x via INT8/FP8 quantization.
Details: GGUF + SageAttention

Sign up or log in to comment