Instructions to use MCG-NJU/videomae-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MCG-NJU/videomae-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("video-classification", model="MCG-NJU/videomae-base")# Load model directly from transformers import AutoImageProcessor, AutoModelForPreTraining processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base") model = AutoModelForPreTraining.from_pretrained("MCG-NJU/videomae-base") - Notebooks
- Google Colab
- Kaggle
[CLS] Token
#1
by insaf-im - opened
model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
list(last_hidden_states.shape)
[1, 1568, 768]
The output of the VideoMAE encoder is 768 features of length 1568. (1 is the batch size)
Can I please know which is the [CLS] token?
Hi,
VideoMAE does not use a CLS token. The sequence length is equal to (num_frames // tubelet_size) * num_patches_per_frame, with num_patches_per_frame = (image_size // patch_size) ** 2.
Hence, in this case: (16//2) * (224 // 16)**2 = 1568.
To get a representation of an entire video, you can simply average pool the last hidden states along the sequence dimension:
import torch
video_features = torch.mean(last_hidden_state, dim=1)