vl-jepa-custom / README.md
max044's picture
Update README.md
196c0f5 verified
---
language:
- en
- fr
license: apache-2.0
library_name: transformers
tags:
- video-search
- v-jepa
- multi-modal
- temporal-grounding
- action-retrieval
datasets:
- max044/Charades_v1_480
metrics:
- loss
---
# VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)
## English Description
This model is a custom implementation of the **VL-JEPA** (Video-Language Joint
Embedding Predictive Architecture) inspired by Meta AI's research. It is
designed for **Temporal Moment Retrieval** (finding specific actions in videos).
### Architecture
- **X-Encoder (Video)**: Frozen
[V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256).
- **Predictor (Refinement)**:
[Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using
**LoRA** (Low-Rank Adaptation).
- **Y-Encoder (Text Target)**: Frozen
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
### Training Details
- **Dataset**:
[Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480)
(Academic dataset for video action localization).
- **Optimization**: LoRA with \\(r=64\\) and \\(\alpha=128\\), targeting `q_proj` and
`v_proj` in Qwen.
- **Learning Rate**: 3e-4 with Cosine Warmup.
- **Outcome**: Only 0.2% of parameters are trainable, making it extremely
lightweight to train and run.
---
## Description en Français
Ce modèle est une implémentation personnalisée de **VL-JEPA**, inspirée des
travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans
les vidéos (**Temporal Moment Retrieval**).
### Architecture
- **Encodeur Vidéo (X)** :
[V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256)
gelé.
- **Prédicteur** : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
adapté avec **LoRA**.
- **Encodeur Texte (Y)** :
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
gelé.
### Détails d'Entraînement
- **Dataset** :
[Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480).
- **Méthode** : Entraînement via LoRA \\(r=64\\), \\(\alpha=128\\).
- **Coût** : Approche très économique, entraînée pour environ 5$ sur Vast.ai.
## Usage / Utilisation
```python
import torch
from vljepa.config import Config
from vljepa.models import VLJepa
# Load model
config = Config()
model = VLJepa(config)
checkpoint = torch.load("best.pth", map_location="cpu")
model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
model.eval()
# Localizing an action
# (Requires preprocessing frames and tokenizing query)
```
Refer to the source code for full inference pipeline with sliding window and
NMS.