| --- |
| language: |
| - en |
| - fr |
| license: apache-2.0 |
| library_name: transformers |
| tags: |
| - video-search |
| - v-jepa |
| - multi-modal |
| - temporal-grounding |
| - action-retrieval |
| datasets: |
| - max044/Charades_v1_480 |
| metrics: |
| - loss |
| --- |
| |
| # VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM) |
|
|
| ## English Description |
|
|
| This model is a custom implementation of the **VL-JEPA** (Video-Language Joint |
| Embedding Predictive Architecture) inspired by Meta AI's research. It is |
| designed for **Temporal Moment Retrieval** (finding specific actions in videos). |
|
|
| ### Architecture |
|
|
| - **X-Encoder (Video)**: Frozen |
| [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256). |
| - **Predictor (Refinement)**: |
| [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using |
| **LoRA** (Low-Rank Adaptation). |
| - **Y-Encoder (Text Target)**: Frozen |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). |
|
|
| ### Training Details |
|
|
| - **Dataset**: |
| [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480) |
| (Academic dataset for video action localization). |
| - **Optimization**: LoRA with \\(r=64\\) and \\(\alpha=128\\), targeting `q_proj` and |
| `v_proj` in Qwen. |
| - **Learning Rate**: 3e-4 with Cosine Warmup. |
| - **Outcome**: Only 0.2% of parameters are trainable, making it extremely |
| lightweight to train and run. |
|
|
| --- |
|
|
| ## Description en Français |
|
|
| Ce modèle est une implémentation personnalisée de **VL-JEPA**, inspirée des |
| travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans |
| les vidéos (**Temporal Moment Retrieval**). |
|
|
| ### Architecture |
|
|
| - **Encodeur Vidéo (X)** : |
| [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256) |
| gelé. |
| - **Prédicteur** : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| adapté avec **LoRA**. |
| - **Encodeur Texte (Y)** : |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) |
| gelé. |
|
|
| ### Détails d'Entraînement |
|
|
| - **Dataset** : |
| [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480). |
| - **Méthode** : Entraînement via LoRA \\(r=64\\), \\(\alpha=128\\). |
| - **Coût** : Approche très économique, entraînée pour environ 5$ sur Vast.ai. |
|
|
| ## Usage / Utilisation |
|
|
| ```python |
| import torch |
| from vljepa.config import Config |
| from vljepa.models import VLJepa |
| |
| # Load model |
| config = Config() |
| model = VLJepa(config) |
| checkpoint = torch.load("best.pth", map_location="cpu") |
| model.predictor.load_state_dict(checkpoint["predictor_state_dict"]) |
| model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"]) |
| model.eval() |
| |
| # Localizing an action |
| # (Requires preprocessing frames and tokenizing query) |
| ``` |
|
|
| Refer to the source code for full inference pipeline with sliding window and |
| NMS. |
|
|