Motif-Technologies
/

Motif-Video-2B

@@ -64,6 +64,7 @@ widget:
 ## 🔥 News
 - **[2026-04-28]** **ComfyUI custom nodes** released: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B). GGUF workflow support coming soon.
 - **[2026-04-28]** **GGUF quantized weights** now available at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) — up to 2.7 GB VRAM savings with no speed penalty. **SageAttention** support for ~2× faster inference. See [GGUF + SageAttention](#🧊-gguf--sageattention) below.
 - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://arxiv.org/abs/2604.16503).

 ## 🔥 News
+- **[2026-04-29]** **RTX 4090 benchmarks** added — SageAttention achieves ~3.16× speedup, all GGUF variants fit in 24 GB. See [GGUF + SageAttention](docs/gguf-sageattention.md#benchmark).
 - **[2026-04-28]** **ComfyUI custom nodes** released: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B). GGUF workflow support coming soon.
 - **[2026-04-28]** **GGUF quantized weights** now available at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) — up to 2.7 GB VRAM savings with no speed penalty. **SageAttention** support for ~2× faster inference. See [GGUF + SageAttention](#🧊-gguf--sageattention) below.
 - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://arxiv.org/abs/2604.16503).

docs/gguf-sageattention.md CHANGED Viewed

@@ -90,7 +90,12 @@ Same prompt and seed, 1280x736, 121 frames, 50 steps. Left = SDPA, Right = SageA
 **Install** (build from source — PyPI only has 1.x, need 2.x):
 ```bash
-# Set TORCH_CUDA_ARCH_LIST to match your GPU: "8.0" for A100, "9.0" for H100/H200
 TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation
 ```
@@ -108,7 +113,7 @@ python inference.py --use-sage-attention --prompt "..."
 - Set `TORCH_CUDA_ARCH_LIST` to match your GPU when building (e.g., `"8.6"` for RTX 3090, `"8.9"` for RTX 4090)
 - No quality degradation observed across all GGUF variants
-## Benchmark
 Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
@@ -130,3 +135,32 @@ Peak alloc/rsv columns show SDPA / Sage values. Sage adds ~0.3 GB alloc overhead
 - **~1.59x faster with SageAttention** — consistent across all quantization levels
 - **VRAM unchanged** — sage overhead is negligible (~0.3 GB alloc)
 - **GGUF + Sage stacks** — Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)

 **Install** (build from source — PyPI only has 1.x, need 2.x):
 ```bash
+# Set TORCH_CUDA_ARCH_LIST to match your GPU:
+#   "8.0" for A100/A30
+#   "8.6" for RTX 3090/3080/A40
+#   "8.9" for RTX 4090/4080/4070 Ti/L40/L40S (Ada Lovelace)
+#   "10.0" for RTX 5090/5080/5070 Ti (Blackwell)
+#   "9.0" for H100/H200
 TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation
 ```
 - Set `TORCH_CUDA_ARCH_LIST` to match your GPU when building (e.g., `"8.6"` for RTX 3090, `"8.9"` for RTX 4090)
 - No quality degradation observed across all GGUF variants
+## Benchmark (H200)
 Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
 - **~1.59x faster with SageAttention** — consistent across all quantization levels
 - **VRAM unchanged** — sage overhead is negligible (~0.3 GB alloc)
 - **GGUF + Sage stacks** — Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)
+---
+## Benchmark (RTX 4090)
+Measured on NVIDIA RTX 4090 (24 GB), 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
+**Environment:** NGC `nvcr.io/nvidia/pytorch:26.01-py3`, Python 3.12.3, PyTorch 2.11.0+cu130, CUDA 13.0.
+SageAttention built from source with `TORCH_CUDA_ARCH_LIST="8.9"`.
+| Variant | SDPA (s/it) | Sage (s/it) | Speedup | Peak alloc (GB) | Total SDPA (s) | Total Sage (s) |
+|---------|------------|------------|---------|-----------------|----------------|----------------|
+| BF16    | 92.54      | 29.17      | 3.17x   | 14.73           | 4665           | 1492           |
+| Q8_0    | 92.51      | 29.18      | 3.17x   | 13.02           | 4658           | 1493           |
+| Q6_K    | 92.81      | 29.41      | 3.16x   | 12.58           | 4673           | 1504           |
+| Q5_K_M  | 92.79      | 29.43      | 3.15x   | 12.36           | 4672           | 1505           |
+| Q5_1    | 92.67      | 29.34      | 3.16x   | 12.45           | 4667           | 1501           |
+| Q5_0    | 92.64      | 29.34      | 3.16x   | 12.34           | 4664           | 1500           |
+| Q4_K_M  | 92.62      | 29.29      | 3.16x   | 12.16           | 4665           | 1502           |
+| Q4_1    | 92.60      | 29.32      | 3.16x   | 12.22           | 4668           | 1499           |
+| Q4_0    | 92.64      | 29.32      | 3.16x   | 12.11           | 4684           | 1500           |
+Peak alloc is identical for SDPA/Sage (SageAttention adds no extra alloc overhead on RTX 4090). Peak reserved is ~14 GB with SDPA and ~16 GB with Sage.
+**Key findings (RTX 4090):**
+- **~3.16x faster with SageAttention** — SM89 FP16 kernels deliver larger relative speedup than H200's FP8 kernels (3.16x vs 1.59x) because SDPA is slower on 4090 while Sage remains fast
+- **All variants fit in 24 GB** — Q4_0 + Sage peaks at 12.11 GB alloc (~16 GB reserved)
+- **GGUF + Sage stacks** — Q4_K_M + Sage: 29.29 s/it at 12.16 GB (vs BF16 SDPA: 92.54 s/it at 14.73 GB)
+- **No quality degradation** — identical to SDPA outputs across all variants