VL-JEPA: Simplified Video-Language Alignment

A simplified implementation of the Video-Language Joint Embedding Predictive Architecture (VL-JEPA) for Temporal Moment Retrieval (Temporal Grounding).

This project uses V-JEPA 2 for video understanding and Qwen 2.5 0.5B as a predictor to align video features with language queries in a high-dimensional embedding space.

🚀 Architecture

The model follows the JEPA framework by aligning video features (X) and text descriptions (Y) through a predictor (P):

X-Encoder (Video): Frozen V-JEPA 2 (ViT-L). High-fidelity hierarchical video features.
Y-Encoder (Text): Frozen MiniLM (all-MiniLM-L6-v2). Compact and efficient semantic text embeddings.
Predictor (Alignment): Qwen 2.5 0.5B with LoRA (Low-Rank Adaptation). Learns to predict the target text embedding from the joint video+query representation.

🛠️ Installation

This project uses uv for lightning-fast dependency management.

# Clone the repository
git clone https://github.com/max044/vl-jepa.git
cd vl-jepa

# Create environment and install dependencies
uv sync

📊 Data Preparation

The model is trained on the Charades-STA dataset for temporal grounding.

Videos: Download Charades v1 and place them in data/Charades_v1_480.
Annotations: Use download_annotations.py to download the annotations.

Structure:

data/
├── Charades_v1_480/      # Video files (.mp4)
├── charades_sta_train.txt
└── charades_sta_test.txt

🏋️ Training

Start training with default hyperparameters:

# Regular training (local, MPS/CPU)
uv run train.py

# Debug mode (small subset, only 2 epochs)
uv run train.py --debug --device mps

Key Training Features:

Bidirectional InfoNCE Loss: Maximizes mutual information between predicted and target embeddings.
LoRA Tuning: Only 0.2% of the predictor parameters (Qwen) are trained, making it extremely memory-efficient.
MPS Support: Optimized for Mac M1/M2/M3 chips.
W&B Integration: Full experiment tracking with model versioning.

☁️ Cloud GPU Training

Train on GPU with Vast.ai (~$0.50–2/h for A100/H100).

Quick Start

# 1. On the cloud instance — bootstrap
curl -sSL https://raw.githubusercontent.com/max044/vl-jepa/main/scripts/bootstrap.sh | bash

# 2. Configure W&B
cd ~/vl-jepa
cp .env.example .env
nano .env  # Set WANDB_API_KEY (get it at https://wandb.ai/authorize)

# 3. Download videos
wget -P data/ https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip
unzip data/Charades_v1_480.zip -d data/

or

uv run hf download max044/Charades_v1_480 --local-dir data/Charades_v1_480 --repo-type dataset

# 4. Launch training
bash scripts/train_cloud.sh

W&B Experiment Tracking

All training runs are tracked on Weights & Biases:

Metrics: loss, InfoNCE, learning rate (per step + per epoch)
System: GPU utilization, memory usage (automatic)
Model versioning: checkpoints uploaded as W&B Artifacts (vl-jepa-best, vl-jepa-last) — every version is preserved and downloadable

# Train with W&B (default)
uv run train.py --device cuda --wandb-project vl-jepa

# Train without W&B
uv run train.py --device cuda --no-wandb

# Custom W&B run name
uv run train.py --device cuda --wandb-run-name "exp-lr3e4-bs16"

Environment Variables

Variable	Description	Required
`WANDB_API_KEY`	W&B API key (get here)	For tracking
`WANDB_PROJECT`	W&B project name (default: `vl-jepa`)	No
`WANDB_ENTITY`	W&B team/organization	No
`EPOCHS`	Override epoch count	No
`BATCH_SIZE`	Override batch size	No

🔍 Inference (Moment Retrieval)

Once trained, you can use the model to find specific moments in a video based on a text query. The script uses a sliding window approach with NMS to find the best matching segments.

# Example: Local inference
uv run infer.py \
    --video data/Charades_v1_480/3MSZA.mp4 \
    --query "person turns on the light" \
    --checkpoint checkpoints/best.pth \
    --device mps

🔍 Implementation Details

Unlike standard VLM (Visual-Language Models) that use generative heads, this VL-JEPA implementation focuses on embedding alignment. This makes it an order of magnitude faster for retrieval tasks (search) as embeddings can be pre-computed and indexed using vector databases (Faiss, Milvus, Chroma).

📚 References

This implementation is based on the official VL-JEPA paper:

@misc{chen2026vljepajointembeddingpredictive,
      title={VL-JEPA: Joint Embedding Predictive Architecture for Vision-language}, 
      author={Delong Chen and Mustafa Shukor and Theo Moutakanni and Willy Chung and Jade Yu and Tejaswi Kasarla and Yejin Bang and Allen Bolourchi and Yann LeCun and Pascale Fung},
      year={2026},
      eprint={2512.10942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10942}, 
}

📄 License

MIT