---
language:
    - en
    - fr
license: apache-2.0
library_name: transformers
tags:
    - video-search
    - v-jepa
    - multi-modal
    - temporal-grounding
    - action-retrieval
datasets:
    - max044/Charades_v1_480
metrics:
    - loss
---

# VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)

## English Description

This model is a custom implementation of the **VL-JEPA** (Video-Language Joint
Embedding Predictive Architecture) inspired by Meta AI's research. It is
designed for **Temporal Moment Retrieval** (finding specific actions in videos).

### Architecture

- **X-Encoder (Video)**: Frozen
  [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256).
- **Predictor (Refinement)**:
  [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using
  **LoRA** (Low-Rank Adaptation).
- **Y-Encoder (Text Target)**: Frozen
  [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

### Training Details

- **Dataset**:
  [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480)
  (Academic dataset for video action localization).
- **Optimization**: LoRA with \\(r=64\\) and \\(\alpha=128\\), targeting `q_proj` and
  `v_proj` in Qwen.
- **Learning Rate**: 3e-4 with Cosine Warmup.
- **Outcome**: Only 0.2% of parameters are trainable, making it extremely
  lightweight to train and run.

---

## Description en Français

Ce modèle est une implémentation personnalisée de **VL-JEPA**, inspirée des
travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans
les vidéos (**Temporal Moment Retrieval**).

### Architecture

- **Encodeur Vidéo (X)** :
  [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256)
  gelé.
- **Prédicteur** : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
  adapté avec **LoRA**.
- **Encodeur Texte (Y)** :
  [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
  gelé.

### Détails d'Entraînement

- **Dataset** :
  [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480).
- **Méthode** : Entraînement via LoRA \\(r=64\\), \\(\alpha=128\\).
- **Coût** : Approche très économique, entraînée pour environ 5$ sur Vast.ai.

## Usage / Utilisation

```python
import torch
from vljepa.config import Config
from vljepa.models import VLJepa

# Load model
config = Config()
model = VLJepa(config)
checkpoint = torch.load("best.pth", map_location="cpu")
model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
model.eval()

# Localizing an action
# (Requires preprocessing frames and tokenizing query)
```

Refer to the source code for full inference pipeline with sliding window and
NMS.