VideoMAE-Base — HD-EPIC Verb & Noun Recognition

Pipeline code + example clips: cstokl3/bimanual_manip_data_processing — clone that repo and follow the action_recognition/README.md to run the full pipeline.

Fine-tuned VideoMAE-Base (SSv2 pretrained) on HD-EPIC, an egocentric kitchen activity dataset collected in high-definition using a head-mounted GoPro camera.

The model predicts three things simultaneously from a 16-frame video clip:

Verb (30 classes): the action being performed
Noun (50 classes): the object being interacted with
Hand side (3 classes): left / right / both

Performance

Metric	Value
Val Verb top-1	41.4%
Val Verb MCA	41.5%
Epoch	5
Training set	16,852 clips (≤800/class)
Val set	2,973 clips

Training is ongoing — this checkpoint reflects the best result so far.

Verb Classes (30)

take, put, wash, open, close, turn-on, cut, turn-off, pour, mix, move, remove, throw, dry, shake, scoop, adjust, squeeze, press, flip, turn, check, scrub, pull, pat, lift, hold, drop, transition, reach

Training Details

Setting	Value
Base model	MCG-NJU/videomae-base-finetuned-ssv2
Input	16 × 224×224, ImageNet-normalised, (B, T, C, H, W)
Batch size	8
Base LR	5e-5 (head)
LR schedule	5-epoch linear warmup → cosine decay
Layer-wise LR decay	0.75 per block (14 param groups)
Backbone blocks frozen	0 (full fine-tune)
Dropout	0.1
Label smoothing	0.1
Weight decay	0.05
Optimiser	AdamW

Quick Start

# 1. Clone the pipeline repo
git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing

# 2. Download weights into action_recognition/
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/best.pt \
     -O action_recognition/videomae_best.pt
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/yolo_epic_kitchens.pt \
     -O action_recognition/yolo_epic_kitchens.pt

# 3. Run on your clips
python action_recognition/full_pipeline.py \
    --backbone videomae \
    --ckpt action_recognition/videomae_best.pt \
    --clips_dir egocentric_clips \
    --json results.json

Python Usage

import torch
from transformers import VideoMAEForVideoClassification

# Load checkpoint
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)

# Rebuild model (requires the VideoMAEMultiHead wrapper — see pipeline repo)
from models.videomae_model import build_videomae

model = build_videomae(
    n_verb_classes=ckpt["n_action_classes"],   # 30
    n_noun_classes=ckpt["n_noun_classes"],     # 50
    pretrained="MCG-NJU/videomae-base-finetuned-ssv2",
)
model.load_state_dict(ckpt["model"])
model.eval()

verb_names = ckpt["verb_names"]   # list of 30 strings
noun_names = ckpt["noun_names"]   # list of 50 strings

# Inference — frames: (1, 16, 3, 224, 224) float32 ImageNet-normalised
with torch.no_grad():
    verb_logits, hand_logits, noun_logits = model(frames)

verb = verb_names[verb_logits.argmax(1).item()]
print("Predicted verb:", verb)

Citation

If you use this model, please cite the HD-EPIC dataset and VideoMAE:

@inproceedings{tong2022videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  booktitle={NeurIPS},
  year={2022}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstokl3/videomae-hdepic-verb-recognition

Base model

MCG-NJU/videomae-base-finetuned-ssv2

Finetuned

(8)

this model