Open-VLJEPA β€” MSRVTT in-domain checkpoint (R@1 = 23.9)

GitHub Paper License Buy Me A Coffee

Trained checkpoint for Open-VLJEPA β€” an open re-implementation of VL-JEPA (Chen et al., 2026, Meta).

Note β€” this is a resource-poor re-implementation. Training was limited to a single workstation (8 Γ— RTX 4090, 192 GB VRAM total) over a few days, versus the paper's 192 Γ— H200 for 4 weeks on ~3.3 B samples. Numbers below reflect that gap.

My work runs on caffeine β˜•. If this checkpoint is useful to you, a small donation helps keep the lights on and more open re-implementations coming.

Results (MSRVTT test, 500-video pool)

This checkpoint is in-domain MSRVTT Stage B (50 epochs), initialized from a CC3M image-text Stage A.

Metric T2V V2T
R@1 23.90 β€”
R@5 51.44 β€”
R@10 65.05 β€”

See eval_history.json for the full 50-epoch trajectory.

Architecture

  • X-Encoder: facebook/vjepa2-vitl-fpc64-256 (frozen, weights NOT in this ckpt β€” re-fetched from HF at load time)
  • Predictor: last 8 layers of meta-llama/Llama-3.2-1B + projection, bi-directional attention
  • Y-Encoder: google/embeddinggemma-300m + projection
  • Loss: bi-directional InfoNCE in a shared 1536-D space

Only the trainable parts (predictor + y_encoder + projections, ~800 M params) are saved in best.pt. The frozen V-JEPA 2 encoder is loaded directly from facebook/vjepa2-vitl-fpc64-256 at inference time.

Usage

git clone https://github.com/dion-jy/open-vljepa
cd open-vljepa
pip install torch transformers webdataset decord pyyaml huggingface-hub

# Download ckpt
huggingface-cli download cun-bjy/open-vljepa best.pt --local-dir checkpoints_msrvtt

# Demo
python scripts/demo_gif.py --ckpt checkpoints_msrvtt/best.pt
# Eval
python scripts/eval.py --ckpt checkpoints_msrvtt/best.pt --n_videos 500

Training

  • Stage A β€” image-text on CC3M (~1 M pairs, 1 frame per visual input), 10 ep
  • Stage B (this ckpt) β€” MSRVTT, 50 ep, init from Stage A best
  • 8 Γ— RTX 4090, bs = 64/gpu (effective contrastive pool 512 via all-gather), LR 1e-4, bf16 + gradient checkpointing

Configs are in the GitHub repo (configs/stage_a.yaml, configs/stage_b_retrain.yaml).

License & Attribution

Built with Llama (meta-llama/Llama-3.2-1B, Llama 3.2 Community License) and Gemma (google/embeddinggemma-300m, Gemma License). V-JEPA 2 (facebook/vjepa2-vitl-fpc64-256, MIT) is used in frozen form.

By using this checkpoint you agree to:

  • The Llama 3.2 Acceptable Use Policy
  • The Gemma Acceptable Use Policy

Citation

@article{chen2026vljepa,
  title   = {VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language},
  author  = {Chen et al.},
  journal = {arXiv preprint arXiv:2512.10942},
  year    = {2026}
}

@misc{openvljepa2026,
  title        = {Open-VLJEPA: a small-scale open re-implementation of VL-JEPA},
  author       = {Baek, Junyeob},
  year         = {2026},
  howpublished = {\url{https://github.com/dion-jy/open-vljepa}},
  note         = {GitHub repository}
}
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cun-bjy/open-vljepa

Finetuned
(8)
this model

Paper for cun-bjy/open-vljepa