Instructions to use Motif-Technologies/Motif-Video-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Motif-Technologies/Motif-Video-2B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Motif-Technologies/Motif-Video-2B", dtype=torch.bfloat16, device_map="cuda") prompt = "A vibrant blue jay perches gracefully on a slender branch, its feathers shimmering in the soft morning light. The bird's keen eyes scan the surroundings, capturing the essence of the tranquil forest. It flutters its wings briefly, showcasing the intricate patterns of blue, white, and black on its plumage. The background reveals a lush canopy of green leaves, with rays of sunlight filtering through, creating a dappled effect on the forest floor. The blue jay then tilts its head, emitting a melodious call that echoes through the serene woodland, adding a touch of magic to the peaceful scene." image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
add-tech-report
#2
by beomgyu-kim - opened
README.md
CHANGED
|
@@ -24,7 +24,7 @@ library_name: diffusers
|
|
| 24 |
</p>
|
| 25 |
|
| 26 |
<p align="center">
|
| 27 |
-
π <a href="
|
| 28 |
π€ <a href="">Hugging Face</a> |
|
| 29 |
π <a href="https://motiftech.io/videoshowcase">Project Page</a>
|
| 30 |
</p>
|
|
@@ -33,7 +33,7 @@ library_name: diffusers
|
|
| 33 |
|
| 34 |
## π₯ News
|
| 35 |
|
| 36 |
-
- **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report]().
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
@@ -93,7 +93,7 @@ A high-level walkthrough of the role separation:
|
|
| 93 |
2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
|
| 94 |
3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
|
| 95 |
|
| 96 |
-
For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report]().
|
| 97 |
|
| 98 |
<!--
|
| 99 |
Optional: insert Figure 3 (attention heatmaps across the three stages)
|
|
@@ -240,7 +240,7 @@ Notable per-dimension highlights for Motif-Video 2B (open-source):
|
|
| 240 |
- **Semantic Score: 80.44%** β highest among open-source models reporting per-dimension results
|
| 241 |
- **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** β second-best in their categories
|
| 242 |
|
| 243 |
-
The full 16-dimension breakdown is in Table 3 of the [technical report]().
|
| 244 |
|
| 245 |
> **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
|
| 246 |
|
|
@@ -295,7 +295,7 @@ If you find Motif-Video 2B useful in your research, please cite:
|
|
| 295 |
author = {Motif Technologies},
|
| 296 |
year = {2026},
|
| 297 |
institution = {Motif Technologies},
|
| 298 |
-
url = {}
|
| 299 |
}
|
| 300 |
```
|
| 301 |
|
|
|
|
| 24 |
</p>
|
| 25 |
|
| 26 |
<p align="center">
|
| 27 |
+
π <a href="Motifvideo_techreport.pdf">Technical Report</a> |
|
| 28 |
π€ <a href="">Hugging Face</a> |
|
| 29 |
π <a href="https://motiftech.io/videoshowcase">Project Page</a>
|
| 30 |
</p>
|
|
|
|
| 33 |
|
| 34 |
## π₯ News
|
| 35 |
|
| 36 |
+
- **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
|
|
| 93 |
2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
|
| 94 |
3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
|
| 95 |
|
| 96 |
+
For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
|
| 97 |
|
| 98 |
<!--
|
| 99 |
Optional: insert Figure 3 (attention heatmaps across the three stages)
|
|
|
|
| 240 |
- **Semantic Score: 80.44%** β highest among open-source models reporting per-dimension results
|
| 241 |
- **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** β second-best in their categories
|
| 242 |
|
| 243 |
+
The full 16-dimension breakdown is in Table 3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
|
| 244 |
|
| 245 |
> **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
|
| 246 |
|
|
|
|
| 295 |
author = {Motif Technologies},
|
| 296 |
year = {2026},
|
| 297 |
institution = {Motif Technologies},
|
| 298 |
+
url = {https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf}
|
| 299 |
}
|
| 300 |
```
|
| 301 |
|