Instructions to use Motif-Technologies/Motif-Video-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Motif-Technologies/Motif-Video-2B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Motif-Technologies/Motif-Video-2B", dtype=torch.bfloat16, device_map="cuda") prompt = "A vibrant blue jay perches gracefully on a slender branch, its feathers shimmering in the soft morning light. The bird's keen eyes scan the surroundings, capturing the essence of the tranquil forest. It flutters its wings briefly, showcasing the intricate patterns of blue, white, and black on its plumage. The background reveals a lush canopy of green leaves, with rays of sunlight filtering through, creating a dappled effect on the forest floor. The blue jay then tilts its head, emitting a melodious call that echoes through the serene woodland, adding a touch of magic to the peaceful scene." image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Is the QK^T result the VRAM bottleneck for video models?
For a 10241024120 video, with 8x spatial VAE downsampling, 4x temporal downsampling, and 2x patching, the latent dimension becomes 64 *64 *30. Since the QK^T shape is L^2, the memory consumption reaches 30 GB. I think this is the reason why though the model is small but the vram demand is still high.
Hi @yunming181920 ,
Good observation on the theoretical L^2 cost.
In practice, FlashAttention/SDPA uses tiling so the full QK^T matrix is never materialized in memory.
The actual VRAM bottleneck is mostly model weights + intermediate activations rather than the attention score matrix itself.
That said, we've found SageAttention helps reduce the memory bandwidth cost of attention by ~1.6x via INT8/FP8 quantization.
Details: GGUF + SageAttention