--- language: - en - fr license: apache-2.0 library_name: transformers tags: - video-search - v-jepa - multi-modal - temporal-grounding - action-retrieval datasets: - max044/Charades_v1_480 metrics: - loss --- # VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM) ## English Description This model is a custom implementation of the **VL-JEPA** (Video-Language Joint Embedding Predictive Architecture) inspired by Meta AI's research. It is designed for **Temporal Moment Retrieval** (finding specific actions in videos). ### Architecture - **X-Encoder (Video)**: Frozen [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256). - **Predictor (Refinement)**: [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using **LoRA** (Low-Rank Adaptation). - **Y-Encoder (Text Target)**: Frozen [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). ### Training Details - **Dataset**: [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480) (Academic dataset for video action localization). - **Optimization**: LoRA with \\(r=64\\) and \\(\alpha=128\\), targeting `q_proj` and `v_proj` in Qwen. - **Learning Rate**: 3e-4 with Cosine Warmup. - **Outcome**: Only 0.2% of parameters are trainable, making it extremely lightweight to train and run. --- ## Description en Français Ce modèle est une implémentation personnalisée de **VL-JEPA**, inspirée des travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans les vidéos (**Temporal Moment Retrieval**). ### Architecture - **Encodeur Vidéo (X)** : [V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256) gelé. - **Prédicteur** : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) adapté avec **LoRA**. - **Encodeur Texte (Y)** : [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) gelé. ### Détails d'Entraînement - **Dataset** : [Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480). - **Méthode** : Entraînement via LoRA \\(r=64\\), \\(\alpha=128\\). - **Coût** : Approche très économique, entraînée pour environ 5$ sur Vast.ai. ## Usage / Utilisation ```python import torch from vljepa.config import Config from vljepa.models import VLJepa # Load model config = Config() model = VLJepa(config) checkpoint = torch.load("best.pth", map_location="cpu") model.predictor.load_state_dict(checkpoint["predictor_state_dict"]) model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"]) model.eval() # Localizing an action # (Requires preprocessing frames and tokenizing query) ``` Refer to the source code for full inference pipeline with sliding window and NMS.