max044
/

vl-jepa-custom

temporal-grounding

action-retrieval

Model card Files Files and versions

vl-jepa-custom / README.md

max044's picture

Update README.md

196c0f5 verified about 1 month ago

|

history blame contribute delete

2.79 kB

	---
	language:
	- en
	- fr
	license: apache-2.0
	library_name: transformers
	tags:
	- video-search
	- v-jepa
	- multi-modal
	- temporal-grounding
	- action-retrieval
	datasets:
	- max044/Charades_v1_480
	metrics:
	- loss
	---

	# VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)

	## English Description

	This model is a custom implementation of the VL-JEPA (Video-Language Joint
	Embedding Predictive Architecture) inspired by Meta AI's research. It is
	designed for Temporal Moment Retrieval (finding specific actions in videos).

	### Architecture

	- X-Encoder (Video): Frozen
	[V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256).
	- Predictor (Refinement):
	[Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) fine-tuned using
	LoRA (Low-Rank Adaptation).
	- Y-Encoder (Text Target): Frozen
	[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

	### Training Details

	- Dataset:
	[Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480)
	(Academic dataset for video action localization).
	- Optimization: LoRA with \\(r=64\\) and \\(\alpha=128\\), targeting `q_proj` and
	`v_proj` in Qwen.
	- Learning Rate: 3e-4 with Cosine Warmup.
	- Outcome: Only 0.2% of parameters are trainable, making it extremely
	lightweight to train and run.

	---

	## Description en Français

	Ce modèle est une implémentation personnalisée de VL-JEPA, inspirée des
	travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans
	les vidéos (Temporal Moment Retrieval).

	### Architecture

	- Encodeur Vidéo (X) :
	[V-JEPA 2 (ViT-L)](https://huggingface.co/facebook/vjepa2-vitl-fpc64-256)
	gelé.
	- Prédicteur : [Qwen 2.5 0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
	adapté avec LoRA.
	- Encodeur Texte (Y) :
	[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
	gelé.

	### Détails d'Entraînement

	- Dataset :
	[Charades-STA](https://huggingface.co/datasets/max044/Charades_v1_480).
	- Méthode : Entraînement via LoRA \\(r=64\\), \\(\alpha=128\\).
	- Coût : Approche très économique, entraînée pour environ 5$ sur Vast.ai.

	## Usage / Utilisation

	```python
	import torch
	from vljepa.config import Config
	from vljepa.models import VLJepa

	# Load model
	config = Config()
	model = VLJepa(config)
	checkpoint = torch.load("best.pth", map_location="cpu")
	model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
	model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
	model.eval()

	# Localizing an action
	# (Requires preprocessing frames and tokenizing query)
	```

	Refer to the source code for full inference pipeline with sliding window and
	NMS.