| | --- |
| | license: mit |
| | pipeline_tag: video-classification |
| | tags: |
| | - video |
| | library_name: transformers |
| | datasets: |
| | - HuggingFaceM4/something_something_v2 |
| | base_model: |
| | - facebook/vjepa2-vitg-fpc64-384 |
| | --- |
| | |
| | # V-JEPA 2 |
| |
|
| | A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of [VJEPA](https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/), resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. |
| | The code is released [in this repository](https://github.com/facebookresearch/vjepa2). |
| |
|
| | <div style="background-color: rgba(251, 255, 120, 0.4); padding: 10px; color: black; border-radius: 5px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);"> |
| | 💡 This is V-JEPA 2 <a href="https://huggingface.co/facebook/vjepa2-vitg-fpc64-384">ViT-g 384</a> model with video classification head pretrained on <a href="https://paperswithcode.com/dataset/something-something-v2" style="color: black;">Something-Something-V2</a> dataset. |
| | </div> |
| | <br></br> |
| | |
| | <img src="https://github.com/user-attachments/assets/914942d8-6a1e-409d-86ff-ff856b7346ab"> |
| |
|
| | ## Installation |
| |
|
| | To run V-JEPA 2 model, ensure you have installed the latest transformers: |
| |
|
| | ```bash |
| | pip install -U git+https://github.com/huggingface/transformers |
| | ``` |
| |
|
| | ## Video classification code snippet |
| |
|
| | ```python |
| | import torch |
| | import numpy as np |
| | |
| | from torchcodec.decoders import VideoDecoder |
| | from transformers import AutoVideoProcessor, AutoModelForVideoClassification |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | # Load model and video preprocessor |
| | hf_repo = "facebook/vjepa2-vitg-fpc64-384-ssv2" |
| | |
| | model = AutoModelForVideoClassification.from_pretrained(hf_repo).to(device) |
| | processor = AutoVideoProcessor.from_pretrained(hf_repo) |
| | |
| | # To load a video, sample the number of frames according to the model. |
| | # For this model, we use 64. |
| | video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/bowling/-WH-lxmGJVY_000005_000015.mp4" |
| | vr = VideoDecoder(video_url) |
| | frame_idx = np.arange(0, model.config.frames_per_clip, 2) # you can define more complex sampling strategy |
| | video = vr.get_frames_at(indices=frame_idx).data # frames x channels x height x width |
| | |
| | # Preprocess and run inference |
| | inputs = processor(video, return_tensors="pt").to(model.device) |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | logits = outputs.logits |
| | |
| | print("Top 5 predicted class names:") |
| | top5_indices = logits.topk(5).indices[0] |
| | top5_probs = torch.softmax(logits, dim=-1).topk(5).values[0] |
| | for idx, prob in zip(top5_indices, top5_probs): |
| | text_label = model.config.id2label[idx.item()] |
| | print(f" - {text_label}: {prob:.2f}") |
| | ``` |
| | Output: |
| | ``` |
| | Top 5 predicted class names: |
| | - Putting [something] onto [something]: 0.39 |
| | - Putting [something similar to other things that are already on the table]: 0.23 |
| | - Stacking [number of] [something]: 0.07 |
| | - Putting [something] into [something]: 0.04 |
| | - Putting [number of] [something] onto [something]: 0.03 |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @techreport{assran2025vjepa2, |
| | title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, |
| | author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and |
| | Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and |
| | Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and |
| | Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and |
| | Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and |
| | Rabbat, Michael and Ballas, Nicolas}, |
| | institution={FAIR at Meta}, |
| | year={2025} |
| | } |
| | ``` |