Open-VLJEPA β MSRVTT in-domain checkpoint (R@1 = 23.9)
Trained checkpoint for Open-VLJEPA β an open re-implementation of VL-JEPA (Chen et al., 2026, Meta).
Note β this is a resource-poor re-implementation. Training was limited to a single workstation (8 Γ RTX 4090, 192 GB VRAM total) over a few days, versus the paper's 192 Γ H200 for 4 weeks on ~3.3 B samples. Numbers below reflect that gap.
My work runs on caffeine β. If this checkpoint is useful to you, a small donation helps keep the lights on and more open re-implementations coming.
Results (MSRVTT test, 500-video pool)
This checkpoint is in-domain MSRVTT Stage B (50 epochs), initialized from a CC3M image-text Stage A.
| Metric | T2V | V2T |
|---|---|---|
| R@1 | 23.90 | β |
| R@5 | 51.44 | β |
| R@10 | 65.05 | β |
See eval_history.json for the full 50-epoch trajectory.
Architecture
- X-Encoder:
facebook/vjepa2-vitl-fpc64-256(frozen, weights NOT in this ckpt β re-fetched from HF at load time) - Predictor: last 8 layers of
meta-llama/Llama-3.2-1B+ projection, bi-directional attention - Y-Encoder:
google/embeddinggemma-300m+ projection - Loss: bi-directional InfoNCE in a shared 1536-D space
Only the trainable parts (predictor + y_encoder + projections, ~800 M params) are saved in best.pt. The frozen V-JEPA 2 encoder is loaded directly from facebook/vjepa2-vitl-fpc64-256 at inference time.
Usage
git clone https://github.com/dion-jy/open-vljepa
cd open-vljepa
pip install torch transformers webdataset decord pyyaml huggingface-hub
# Download ckpt
huggingface-cli download cun-bjy/open-vljepa best.pt --local-dir checkpoints_msrvtt
# Demo
python scripts/demo_gif.py --ckpt checkpoints_msrvtt/best.pt
# Eval
python scripts/eval.py --ckpt checkpoints_msrvtt/best.pt --n_videos 500
Training
- Stage A β image-text on CC3M (~1 M pairs, 1 frame per visual input), 10 ep
- Stage B (this ckpt) β MSRVTT, 50 ep, init from Stage A best
- 8 Γ RTX 4090, bs = 64/gpu (effective contrastive pool 512 via all-gather), LR 1e-4, bf16 + gradient checkpointing
Configs are in the GitHub repo (configs/stage_a.yaml, configs/stage_b_retrain.yaml).
License & Attribution
Built with Llama (meta-llama/Llama-3.2-1B, Llama 3.2 Community License) and Gemma (google/embeddinggemma-300m, Gemma License). V-JEPA 2 (facebook/vjepa2-vitl-fpc64-256, MIT) is used in frozen form.
By using this checkpoint you agree to:
- The Llama 3.2 Acceptable Use Policy
- The Gemma Acceptable Use Policy
Citation
@article{chen2026vljepa,
title = {VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language},
author = {Chen et al.},
journal = {arXiv preprint arXiv:2512.10942},
year = {2026}
}
@misc{openvljepa2026,
title = {Open-VLJEPA: a small-scale open re-implementation of VL-JEPA},
author = {Baek, Junyeob},
year = {2026},
howpublished = {\url{https://github.com/dion-jy/open-vljepa}},
note = {GitHub repository}
}
- Downloads last month
- 11
Model tree for cun-bjy/open-vljepa
Base model
facebook/vjepa2-vitl-fpc64-256