docs: add RTX 4090 benchmark + GPU arch list for SageAttention build

#22
Files changed (2) hide show
  1. README.md +1 -0
  2. docs/gguf-sageattention.md +36 -2
README.md CHANGED
@@ -64,6 +64,7 @@ widget:
64
 
65
  ## πŸ”₯ News
66
 
 
67
  - **[2026-04-28]** **ComfyUI custom nodes** released: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B). GGUF workflow support coming soon.
68
  - **[2026-04-28]** **GGUF quantized weights** now available at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) β€” up to 2.7 GB VRAM savings with no speed penalty. **SageAttention** support for ~2Γ— faster inference. See [GGUF + SageAttention](#🧊-gguf--sageattention) below.
69
  - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://arxiv.org/abs/2604.16503).
 
64
 
65
  ## πŸ”₯ News
66
 
67
+ - **[2026-04-29]** **RTX 4090 benchmarks** added β€” SageAttention achieves ~3.16Γ— speedup, all GGUF variants fit in 24 GB. See [GGUF + SageAttention](docs/gguf-sageattention.md#benchmark).
68
  - **[2026-04-28]** **ComfyUI custom nodes** released: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B). GGUF workflow support coming soon.
69
  - **[2026-04-28]** **GGUF quantized weights** now available at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) β€” up to 2.7 GB VRAM savings with no speed penalty. **SageAttention** support for ~2Γ— faster inference. See [GGUF + SageAttention](#🧊-gguf--sageattention) below.
70
  - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://arxiv.org/abs/2604.16503).
docs/gguf-sageattention.md CHANGED
@@ -90,7 +90,12 @@ Same prompt and seed, 1280x736, 121 frames, 50 steps. Left = SDPA, Right = SageA
90
  **Install** (build from source β€” PyPI only has 1.x, need 2.x):
91
 
92
  ```bash
93
- # Set TORCH_CUDA_ARCH_LIST to match your GPU: "8.0" for A100, "9.0" for H100/H200
 
 
 
 
 
94
  TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation
95
  ```
96
 
@@ -108,7 +113,7 @@ python inference.py --use-sage-attention --prompt "..."
108
  - Set `TORCH_CUDA_ARCH_LIST` to match your GPU when building (e.g., `"8.6"` for RTX 3090, `"8.9"` for RTX 4090)
109
  - No quality degradation observed across all GGUF variants
110
 
111
- ## Benchmark
112
 
113
  Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
114
 
@@ -130,3 +135,32 @@ Peak alloc/rsv columns show SDPA / Sage values. Sage adds ~0.3 GB alloc overhead
130
  - **~1.59x faster with SageAttention** β€” consistent across all quantization levels
131
  - **VRAM unchanged** β€” sage overhead is negligible (~0.3 GB alloc)
132
  - **GGUF + Sage stacks** β€” Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  **Install** (build from source β€” PyPI only has 1.x, need 2.x):
91
 
92
  ```bash
93
+ # Set TORCH_CUDA_ARCH_LIST to match your GPU:
94
+ # "8.0" for A100/A30
95
+ # "8.6" for RTX 3090/3080/A40
96
+ # "8.9" for RTX 4090/4080/4070 Ti/L40/L40S (Ada Lovelace)
97
+ # "10.0" for RTX 5090/5080/5070 Ti (Blackwell)
98
+ # "9.0" for H100/H200
99
  TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation
100
  ```
101
 
 
113
  - Set `TORCH_CUDA_ARCH_LIST` to match your GPU when building (e.g., `"8.6"` for RTX 3090, `"8.9"` for RTX 4090)
114
  - No quality degradation observed across all GGUF variants
115
 
116
+ ## Benchmark (H200)
117
 
118
  Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
119
 
 
135
  - **~1.59x faster with SageAttention** β€” consistent across all quantization levels
136
  - **VRAM unchanged** β€” sage overhead is negligible (~0.3 GB alloc)
137
  - **GGUF + Sage stacks** β€” Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)
138
+
139
+ ---
140
+
141
+ ## Benchmark (RTX 4090)
142
+
143
+ Measured on NVIDIA RTX 4090 (24 GB), 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
144
+
145
+ **Environment:** NGC `nvcr.io/nvidia/pytorch:26.01-py3`, Python 3.12.3, PyTorch 2.11.0+cu130, CUDA 13.0.
146
+ SageAttention built from source with `TORCH_CUDA_ARCH_LIST="8.9"`.
147
+
148
+ | Variant | SDPA (s/it) | Sage (s/it) | Speedup | Peak alloc (GB) | Total SDPA (s) | Total Sage (s) |
149
+ |---------|------------|------------|---------|-----------------|----------------|----------------|
150
+ | BF16 | 92.54 | 29.17 | 3.17x | 14.73 | 4665 | 1492 |
151
+ | Q8_0 | 92.51 | 29.18 | 3.17x | 13.02 | 4658 | 1493 |
152
+ | Q6_K | 92.81 | 29.41 | 3.16x | 12.58 | 4673 | 1504 |
153
+ | Q5_K_M | 92.79 | 29.43 | 3.15x | 12.36 | 4672 | 1505 |
154
+ | Q5_1 | 92.67 | 29.34 | 3.16x | 12.45 | 4667 | 1501 |
155
+ | Q5_0 | 92.64 | 29.34 | 3.16x | 12.34 | 4664 | 1500 |
156
+ | Q4_K_M | 92.62 | 29.29 | 3.16x | 12.16 | 4665 | 1502 |
157
+ | Q4_1 | 92.60 | 29.32 | 3.16x | 12.22 | 4668 | 1499 |
158
+ | Q4_0 | 92.64 | 29.32 | 3.16x | 12.11 | 4684 | 1500 |
159
+
160
+ Peak alloc is identical for SDPA/Sage (SageAttention adds no extra alloc overhead on RTX 4090). Peak reserved is ~14 GB with SDPA and ~16 GB with Sage.
161
+
162
+ **Key findings (RTX 4090):**
163
+ - **~3.16x faster with SageAttention** β€” SM89 FP16 kernels deliver larger relative speedup than H200's FP8 kernels (3.16x vs 1.59x) because SDPA is slower on 4090 while Sage remains fast
164
+ - **All variants fit in 24 GB** β€” Q4_0 + Sage peaks at 12.11 GB alloc (~16 GB reserved)
165
+ - **GGUF + Sage stacks** β€” Q4_K_M + Sage: 29.29 s/it at 12.16 GB (vs BF16 SDPA: 92.54 s/it at 14.73 GB)
166
+ - **No quality degradation** β€” identical to SDPA outputs across all variants