Is the QK^T result the VRAM bottleneck for video models?

#10

by yunming181920 - opened Apr 22

yunming181920

Apr 22

For a 10241024120 video, with 8x spatial VAE downsampling, 4x temporal downsampling, and 2x patching, the latent dimension becomes 64 *64 *30. Since the QK^T shape is L^2, the memory consumption reaches 30 GB. I think this is the reason why though the model is small but the vram demand is still high.

gkalstn0

Motif Technologies org 27 days ago

Hi @yunming181920 ,

Good observation on the theoretical L^2 cost.
In practice, FlashAttention/SDPA uses tiling so the full QK^T matrix is never materialized in memory.
The actual VRAM bottleneck is mostly model weights + intermediate activations rather than the attention score matrix itself.

That said, we've found SageAttention helps reduce the memory bandwidth cost of attention by ~1.6x via INT8/FP8 quantization.
Details: GGUF + SageAttention

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment