VideoMAE-Base β€” HD-EPIC Verb & Noun Recognition

Pipeline code + example clips: cstokl3/bimanual_manip_data_processing β€” clone that repo and follow the action_recognition/README.md to run the full pipeline.

Fine-tuned VideoMAE-Base (SSv2 pretrained) on HD-EPIC, an egocentric kitchen activity dataset collected in high-definition using a head-mounted GoPro camera.

The model predicts three things simultaneously from a 16-frame video clip:

  • Verb (30 classes): the action being performed
  • Noun (50 classes): the object being interacted with
  • Hand side (3 classes): left / right / both

Performance

Metric Value
Val Verb top-1 41.4%
Val Verb MCA 41.5%
Epoch 5
Training set 16,852 clips (≀800/class)
Val set 2,973 clips

Training is ongoing β€” this checkpoint reflects the best result so far.

Verb Classes (30)

take, put, wash, open, close, turn-on, cut, turn-off, pour, mix, move, remove, throw, dry, shake, scoop, adjust, squeeze, press, flip, turn, check, scrub, pull, pat, lift, hold, drop, transition, reach

Training Details

Setting Value
Base model MCG-NJU/videomae-base-finetuned-ssv2
Input 16 Γ— 224Γ—224, ImageNet-normalised, (B, T, C, H, W)
Batch size 8
Base LR 5e-5 (head)
LR schedule 5-epoch linear warmup β†’ cosine decay
Layer-wise LR decay 0.75 per block (14 param groups)
Backbone blocks frozen 0 (full fine-tune)
Dropout 0.1
Label smoothing 0.1
Weight decay 0.05
Optimiser AdamW

Quick Start

# 1. Clone the pipeline repo
git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing

# 2. Download weights into action_recognition/
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/best.pt \
     -O action_recognition/videomae_best.pt
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/yolo_epic_kitchens.pt \
     -O action_recognition/yolo_epic_kitchens.pt

# 3. Run on your clips
python action_recognition/full_pipeline.py \
    --backbone videomae \
    --ckpt action_recognition/videomae_best.pt \
    --clips_dir egocentric_clips \
    --json results.json

Python Usage

import torch
from transformers import VideoMAEForVideoClassification

# Load checkpoint
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)

# Rebuild model (requires the VideoMAEMultiHead wrapper β€” see pipeline repo)
from models.videomae_model import build_videomae

model = build_videomae(
    n_verb_classes=ckpt["n_action_classes"],   # 30
    n_noun_classes=ckpt["n_noun_classes"],     # 50
    pretrained="MCG-NJU/videomae-base-finetuned-ssv2",
)
model.load_state_dict(ckpt["model"])
model.eval()

verb_names = ckpt["verb_names"]   # list of 30 strings
noun_names = ckpt["noun_names"]   # list of 50 strings

# Inference β€” frames: (1, 16, 3, 224, 224) float32 ImageNet-normalised
with torch.no_grad():
    verb_logits, hand_logits, noun_logits = model(frames)

verb = verb_names[verb_logits.argmax(1).item()]
print("Predicted verb:", verb)

Citation

If you use this model, please cite the HD-EPIC dataset and VideoMAE:

@inproceedings{tong2022videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  booktitle={NeurIPS},
  year={2022}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstokl3/videomae-hdepic-verb-recognition

Finetuned
(11)
this model