vl-jepa-custom / original_readme.md
max044's picture
Upload folder using huggingface_hub
2fd5fdc verified

VL-JEPA: Simplified Video-Language Alignment

A simplified implementation of the Video-Language Joint Embedding Predictive Architecture (VL-JEPA) for Temporal Moment Retrieval (Temporal Grounding).

This project uses V-JEPA 2 for video understanding and Qwen 2.5 0.5B as a predictor to align video features with language queries in a high-dimensional embedding space.

πŸš€ Architecture

The model follows the JEPA framework by aligning video features (X) and text descriptions (Y) through a predictor (P):

  • X-Encoder (Video): Frozen V-JEPA 2 (ViT-L). High-fidelity hierarchical video features.
  • Y-Encoder (Text): Frozen MiniLM (all-MiniLM-L6-v2). Compact and efficient semantic text embeddings.
  • Predictor (Alignment): Qwen 2.5 0.5B with LoRA (Low-Rank Adaptation). Learns to predict the target text embedding from the joint video+query representation.

πŸ› οΈ Installation

This project uses uv for lightning-fast dependency management.

# Clone the repository
git clone https://github.com/max044/vl-jepa.git
cd vl-jepa

# Create environment and install dependencies
uv sync

πŸ“Š Data Preparation

The model is trained on the Charades-STA dataset for temporal grounding.

  1. Videos: Download Charades v1 and place them in data/Charades_v1_480.
  2. Annotations: Use download_annotations.py to download the annotations.

Structure:

data/
β”œβ”€β”€ Charades_v1_480/      # Video files (.mp4)
β”œβ”€β”€ charades_sta_train.txt
└── charades_sta_test.txt

πŸ‹οΈ Training

Start training with default hyperparameters:

# Regular training (local, MPS/CPU)
uv run train.py

# Debug mode (small subset, only 2 epochs)
uv run train.py --debug --device mps

Key Training Features:

  • Bidirectional InfoNCE Loss: Maximizes mutual information between predicted and target embeddings.
  • LoRA Tuning: Only 0.2% of the predictor parameters (Qwen) are trained, making it extremely memory-efficient.
  • MPS Support: Optimized for Mac M1/M2/M3 chips.
  • W&B Integration: Full experiment tracking with model versioning.

☁️ Cloud GPU Training

Train on GPU with Vast.ai (~$0.50–2/h for A100/H100).

Quick Start

# 1. On the cloud instance β€” bootstrap
curl -sSL https://raw.githubusercontent.com/max044/vl-jepa/main/scripts/bootstrap.sh | bash

# 2. Configure W&B
cd ~/vl-jepa
cp .env.example .env
nano .env  # Set WANDB_API_KEY (get it at https://wandb.ai/authorize)

# 3. Download videos
wget -P data/ https://ai2-public-datasets.s3-us-west-2.amazonaws.com/charades/Charades_v1_480.zip
unzip data/Charades_v1_480.zip -d data/

or

uv run hf download max044/Charades_v1_480 --local-dir data/Charades_v1_480 --repo-type dataset

# 4. Launch training
bash scripts/train_cloud.sh

W&B Experiment Tracking

All training runs are tracked on Weights & Biases:

  • Metrics: loss, InfoNCE, learning rate (per step + per epoch)
  • System: GPU utilization, memory usage (automatic)
  • Model versioning: checkpoints uploaded as W&B Artifacts (vl-jepa-best, vl-jepa-last) β€” every version is preserved and downloadable
# Train with W&B (default)
uv run train.py --device cuda --wandb-project vl-jepa

# Train without W&B
uv run train.py --device cuda --no-wandb

# Custom W&B run name
uv run train.py --device cuda --wandb-run-name "exp-lr3e4-bs16"

Environment Variables

Variable Description Required
WANDB_API_KEY W&B API key (get here) For tracking
WANDB_PROJECT W&B project name (default: vl-jepa) No
WANDB_ENTITY W&B team/organization No
EPOCHS Override epoch count No
BATCH_SIZE Override batch size No

πŸ” Inference (Moment Retrieval)

Once trained, you can use the model to find specific moments in a video based on a text query. The script uses a sliding window approach with NMS to find the best matching segments.

# Example: Local inference
uv run infer.py \
    --video data/Charades_v1_480/3MSZA.mp4 \
    --query "person turns on the light" \
    --checkpoint checkpoints/best.pth \
    --device mps

πŸ” Implementation Details

Unlike standard VLM (Visual-Language Models) that use generative heads, this VL-JEPA implementation focuses on embedding alignment. This makes it an order of magnitude faster for retrieval tasks (search) as embeddings can be pre-computed and indexed using vector databases (Faiss, Milvus, Chroma).

πŸ“š References

This implementation is based on the official VL-JEPA paper:

@misc{chen2026vljepajointembeddingpredictive,
      title={VL-JEPA: Joint Embedding Predictive Architecture for Vision-language}, 
      author={Delong Chen and Mustafa Shukor and Theo Moutakanni and Willy Chung and Jade Yu and Tejaswi Kasarla and Yejin Bang and Allen Bolourchi and Yann LeCun and Pascale Fung},
      year={2026},
      eprint={2512.10942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10942}, 
}

πŸ“„ License

MIT