Open-VLJEPA — MSRVTT in-domain checkpoint (R@1 = 23.9)

Trained checkpoint for Open-VLJEPA — an open re-implementation of VL-JEPA (Chen et al., 2026, Meta).

Note — this is a resource-poor re-implementation. Training was limited to a single workstation (8 × RTX 4090, 192 GB VRAM total) over a few days, versus the paper's 192 × H200 for 4 weeks on ~3.3 B samples. Numbers below reflect that gap.

My work runs on caffeine ☕. If this checkpoint is useful to you, a small donation helps keep the lights on and more open re-implementations coming.

Results (MSRVTT test, 500-video pool)

This checkpoint is in-domain MSRVTT Stage B (50 epochs), initialized from a CC3M image-text Stage A.

Metric	T2V	V2T
R@1	23.90	—
R@5	51.44	—
R@10	65.05	—

See eval_history.json for the full 50-epoch trajectory.

Architecture

X-Encoder: facebook/vjepa2-vitl-fpc64-256 (frozen, weights NOT in this ckpt — re-fetched from HF at load time)
Predictor: last 8 layers of meta-llama/Llama-3.2-1B + projection, bi-directional attention
Y-Encoder: google/embeddinggemma-300m + projection
Loss: bi-directional InfoNCE in a shared 1536-D space

Only the trainable parts (predictor + y_encoder + projections, ~800 M params) are saved in best.pt. The frozen V-JEPA 2 encoder is loaded directly from facebook/vjepa2-vitl-fpc64-256 at inference time.

Usage

git clone https://github.com/dion-jy/open-vljepa
cd open-vljepa
pip install torch transformers webdataset decord pyyaml huggingface-hub

# Download ckpt
huggingface-cli download cun-bjy/open-vljepa best.pt --local-dir checkpoints_msrvtt

# Demo
python scripts/demo_gif.py --ckpt checkpoints_msrvtt/best.pt
# Eval
python scripts/eval.py --ckpt checkpoints_msrvtt/best.pt --n_videos 500

Training

Stage A — image-text on CC3M (~1 M pairs, 1 frame per visual input), 10 ep
Stage B (this ckpt) — MSRVTT, 50 ep, init from Stage A best
8 × RTX 4090, bs = 64/gpu (effective contrastive pool 512 via all-gather), LR 1e-4, bf16 + gradient checkpointing

Configs are in the GitHub repo (configs/stage_a.yaml, configs/stage_b_retrain.yaml).

License & Attribution

Built with Llama (meta-llama/Llama-3.2-1B, Llama 3.2 Community License) and Gemma (google/embeddinggemma-300m, Gemma License). V-JEPA 2 (facebook/vjepa2-vitl-fpc64-256, MIT) is used in frozen form.

By using this checkpoint you agree to:

The Llama 3.2 Acceptable Use Policy
The Gemma Acceptable Use Policy

Citation

@article{chen2026vljepa,
  title   = {VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language},
  author  = {Chen et al.},
  journal = {arXiv preprint arXiv:2512.10942},
  year    = {2026}
}

@misc{openvljepa2026,
  title        = {Open-VLJEPA: a small-scale open re-implementation of VL-JEPA},
  author       = {Baek, Junyeob},
  year         = {2026},
  howpublished = {\url{https://github.com/dion-jy/open-vljepa}},
  note         = {GitHub repository}
}

Downloads last month: 20

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cun-bjy/open-vljepa

Base model

facebook/vjepa2-vitl-fpc64-256

Finetuned

(8)

this model

Paper for cun-bjy/open-vljepa

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Paper • 2512.10942 • Published Dec 11, 2025 • 61