Update README.md

196c0f5 verified about 1 month ago

2.79 kB

language:
  - en
  - fr
license: apache-2.0
library_name: transformers
tags:
  - video-search
  - v-jepa
  - multi-modal
  - temporal-grounding
  - action-retrieval
datasets:
  - max044/Charades_v1_480
metrics:
  - loss

VL-JEPA Custom (V-JEPA 2 + Qwen 2.5 + MiniLM)

English Description

This model is a custom implementation of the VL-JEPA (Video-Language Joint Embedding Predictive Architecture) inspired by Meta AI's research. It is designed for Temporal Moment Retrieval (finding specific actions in videos).

Architecture

X-Encoder (Video): Frozen V-JEPA 2 (ViT-L).
Predictor (Refinement): Qwen 2.5 0.5B fine-tuned using LoRA (Low-Rank Adaptation).
Y-Encoder (Text Target): Frozen all-MiniLM-L6-v2.

Training Details

Dataset: Charades-STA (Academic dataset for video action localization).
Optimization: LoRA with $r = 64$ and $\alpha=128$ , targeting q_proj and v_proj in Qwen.
Learning Rate: 3e-4 with Cosine Warmup.
Outcome: Only 0.2% of parameters are trainable, making it extremely lightweight to train and run.

Description en Français

Ce modèle est une implémentation personnalisée de VL-JEPA, inspirée des travaux de Meta AI. Il est optimisé pour la recherche d'actions temporelles dans les vidéos (Temporal Moment Retrieval).

Architecture

Encodeur Vidéo (X) : V-JEPA 2 (ViT-L) gelé.
Prédicteur : Qwen 2.5 0.5B adapté avec LoRA.
Encodeur Texte (Y) : all-MiniLM-L6-v2 gelé.

Détails d'Entraînement

Dataset : Charades-STA.
Méthode : Entraînement via LoRA $r = 64$ , $\alpha=128$ .
Coût : Approche très économique, entraînée pour environ 5$ sur Vast.ai.

Usage / Utilisation

import torch
from vljepa.config import Config
from vljepa.models import VLJepa

# Load model
config = Config()
model = VLJepa(config)
checkpoint = torch.load("best.pth", map_location="cpu")
model.predictor.load_state_dict(checkpoint["predictor_state_dict"])
model.y_encoder.projection.load_state_dict(checkpoint["y_projection_state_dict"])
model.eval()

# Localizing an action
# (Requires preprocessing frames and tokenizing query)

Refer to the source code for full inference pipeline with sliding window and NMS.