Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -24,7 +24,7 @@ library_name: diffusers
24
  </p>
25
 
26
  <p align="center">
27
- πŸ“‘ <a href="https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf">Technical Report</a> &nbsp;|&nbsp;
28
  πŸ€— <a href="">Hugging Face</a> &nbsp;|&nbsp;
29
  🌐 <a href="https://motiftech.io/videoshowcase">Project Page</a>
30
  </p>
@@ -33,7 +33,7 @@ library_name: diffusers
33
 
34
  ## πŸ”₯ News
35
 
36
- - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report]().
37
 
38
  ---
39
 
@@ -93,7 +93,7 @@ A high-level walkthrough of the role separation:
93
  2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
94
  3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
95
 
96
- For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report]().
97
 
98
  <!--
99
  Optional: insert Figure 3 (attention heatmaps across the three stages)
@@ -240,7 +240,7 @@ Notable per-dimension highlights for Motif-Video 2B (open-source):
240
  - **Semantic Score: 80.44%** β€” highest among open-source models reporting per-dimension results
241
  - **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** β€” second-best in their categories
242
 
243
- The full 16-dimension breakdown is in Table 3 of the [technical report]().
244
 
245
  > **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
246
 
@@ -295,7 +295,7 @@ If you find Motif-Video 2B useful in your research, please cite:
295
  author = {Motif Technologies},
296
  year = {2026},
297
  institution = {Motif Technologies},
298
- url = {}
299
  }
300
  ```
301
 
 
24
  </p>
25
 
26
  <p align="center">
27
+ πŸ“‘ <a href="Motifvideo_techreport.pdf">Technical Report</a> &nbsp;|&nbsp;
28
  πŸ€— <a href="">Hugging Face</a> &nbsp;|&nbsp;
29
  🌐 <a href="https://motiftech.io/videoshowcase">Project Page</a>
30
  </p>
 
33
 
34
  ## πŸ”₯ News
35
 
36
+ - **[2026-04-14]** We release **Motif-Video 2B**, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
37
 
38
  ---
39
 
 
93
  2. **Single-stream stage (16 layers).** Text and video tokens attend freely in a joint sequence. **Shared Cross-Attention** is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
94
  3. **DDT decoder (8 layers).** A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.
95
 
96
+ For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
97
 
98
  <!--
99
  Optional: insert Figure 3 (attention heatmaps across the three stages)
 
240
  - **Semantic Score: 80.44%** β€” highest among open-source models reporting per-dimension results
241
  - **Object Class: 92.93%**, **Multiple Objects: 77.29%**, **Imaging Quality: 70.50%** β€” second-best in their categories
242
 
243
+ The full 16-dimension breakdown is in Table 3 of the [technical report](https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf).
244
 
245
  > **A note on VBench vs. perceptual quality.** Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.
246
 
 
295
  author = {Motif Technologies},
296
  year = {2026},
297
  institution = {Motif Technologies},
298
+ url = {https://huggingface.co/Motif-Technologies/Motif-Video-2B/blob/main/motif-video-technical-report.pdf}
299
  }
300
  ```
301