Motif-Technologies
/

Motif-Video-2B

@@ -24,7 +24,7 @@ library_name: diffusers
 </p>
 <p align="center">
-  📑 <a href="https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf">Technical Report</a> &nbsp;|&nbsp;
   🤗 <a href="">Hugging Face</a> &nbsp;|&nbsp;
   🌐 <a href="https://motiftech.io/videoshowcase">Project Page</a>
 </p>
@@ -33,7 +33,7 @@ library_name: diffusers
 ## 🔥 News
-- **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report]().
 ---
@@ -93,7 +93,7 @@ A high-level walkthrough of the role separation:
 2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
 3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
-For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report]().
 <!--
   Optional: insert Figure 3 (attention heatmaps across the three stages)
@@ -240,7 +240,7 @@ Notable per-dimension highlights for Motif-Video 2B (open-source):
 - **Semantic Score: 80.44%** — highest among open-source models reporting per-dimension results
 - **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** — second-best in their categories
-The full 16-dimension breakdown is in Table 3 of the [technical report]().
 > **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
@@ -295,7 +295,7 @@ If you find Motif-Video 2B useful in your research, please cite:
   author = {Motif Technologies},
   year   = {2026},
   institution = {Motif Technologies},
-  url    = {}
 }
 ```

 </p>
 <p align="center">
+  📑 <a href="Motifvideo_techreport.pdf">Technical Report</a> &nbsp;|&nbsp;
   🤗 <a href="">Hugging Face</a> &nbsp;|&nbsp;
   🌐 <a href="https://motiftech.io/videoshowcase">Project Page</a>
 </p>
 ## 🔥 News
+- **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
 ---
 2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
 3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
+For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
 <!--
   Optional: insert Figure 3 (attention heatmaps across the three stages)
 - **Semantic Score: 80.44%** — highest among open-source models reporting per-dimension results
 - **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** — second-best in their categories
+The full 16-dimension breakdown is in Table 3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
 > **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
   author = {Motif Technologies},
   year   = {2026},
   institution = {Motif Technologies},
+  url    = {https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf}
 }
 ```