Instructions to use Motif-Technologies/Motif-Video-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Motif-Technologies/Motif-Video-2B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Motif-Technologies/Motif-Video-2B", dtype=torch.bfloat16, device_map="cuda") prompt = "A vibrant blue jay perches gracefully on a slender branch, its feathers shimmering in the soft morning light. The bird's keen eyes scan the surroundings, capturing the essence of the tranquil forest. It flutters its wings briefly, showcasing the intricate patterns of blue, white, and black on its plumage. The background reveals a lush canopy of green leaves, with rays of sunlight filtering through, creating a dappled effect on the forest floor. The blue jay then tilts its head, emitting a melodious call that echoes through the serene woodland, adding a touch of magic to the peaceful scene." image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Distilled from VJEPA feature space?
It seems like the ability to lean on VJEPA (Pretrained on what, a billion videos?) probably jumpstarted the training a lot. Nice work, will dive into the architecture more. From visual results seems comparable to Wan2.1 1.3B
Thanks for checking it out, really appreciate it!
Yeah, V-JEPA gives a pretty strong prior on video features "distilled from JEPA" isn't a bad way to describe it. We use V-JEPA for REPA-style representation alignment, but only in the early training phase (noted in our tech report).
A couple of implementation details on why early-only:
(1) We disable REPA later in training, inspired by this paper. Interesting contrast: models like Waver go the other direction β REPA only from intermediate stages (480p), since REPA compute is relatively heavy during low-res / image pretraining. We went "REPA early, then off" based on that paper plus reason (2).
(2) Following this work, we worked under the assumption that dense features matter a lot for REPA to really pay off. V-JEPA (as the V-JEPA 2.1 paper notes, and as we show with examples in our tech report) isn't particularly dense-feature-rich. We didn't see the order-of-magnitude speedup the original REPA paper reported, and this is probably why. Next time around, I think the right move is a teacher that's both dense-feature-rich and has temporal compression (which V-JEPA 2.1 already does).
More details on the rest of the architecture are in the tech report if you want to dig in.
Thanks again!