VideoMAE-Base β HD-EPIC Verb & Noun Recognition
Pipeline code + example clips: cstokl3/bimanual_manip_data_processing β clone that repo and follow the
action_recognition/README.mdto run the full pipeline.
Fine-tuned VideoMAE-Base (SSv2 pretrained) on HD-EPIC, an egocentric kitchen activity dataset collected in high-definition using a head-mounted GoPro camera.
The model predicts three things simultaneously from a 16-frame video clip:
- Verb (30 classes): the action being performed
- Noun (50 classes): the object being interacted with
- Hand side (3 classes): left / right / both
Performance
| Metric | Value |
|---|---|
| Val Verb top-1 | 41.4% |
| Val Verb MCA | 41.5% |
| Epoch | 5 |
| Training set | 16,852 clips (β€800/class) |
| Val set | 2,973 clips |
Training is ongoing β this checkpoint reflects the best result so far.
Verb Classes (30)
take, put, wash, open, close, turn-on, cut, turn-off, pour, mix,
move, remove, throw, dry, shake, scoop, adjust, squeeze, press,
flip, turn, check, scrub, pull, pat, lift, hold, drop,
transition, reach
Training Details
| Setting | Value |
|---|---|
| Base model | MCG-NJU/videomae-base-finetuned-ssv2 |
| Input | 16 Γ 224Γ224, ImageNet-normalised, (B, T, C, H, W) |
| Batch size | 8 |
| Base LR | 5e-5 (head) |
| LR schedule | 5-epoch linear warmup β cosine decay |
| Layer-wise LR decay | 0.75 per block (14 param groups) |
| Backbone blocks frozen | 0 (full fine-tune) |
| Dropout | 0.1 |
| Label smoothing | 0.1 |
| Weight decay | 0.05 |
| Optimiser | AdamW |
Quick Start
# 1. Clone the pipeline repo
git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing
# 2. Download weights into action_recognition/
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/best.pt \
-O action_recognition/videomae_best.pt
wget https://huggingface.co/cstokl3/videomae-hdepic-verb-recognition/resolve/main/yolo_epic_kitchens.pt \
-O action_recognition/yolo_epic_kitchens.pt
# 3. Run on your clips
python action_recognition/full_pipeline.py \
--backbone videomae \
--ckpt action_recognition/videomae_best.pt \
--clips_dir egocentric_clips \
--json results.json
Python Usage
import torch
from transformers import VideoMAEForVideoClassification
# Load checkpoint
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
# Rebuild model (requires the VideoMAEMultiHead wrapper β see pipeline repo)
from models.videomae_model import build_videomae
model = build_videomae(
n_verb_classes=ckpt["n_action_classes"], # 30
n_noun_classes=ckpt["n_noun_classes"], # 50
pretrained="MCG-NJU/videomae-base-finetuned-ssv2",
)
model.load_state_dict(ckpt["model"])
model.eval()
verb_names = ckpt["verb_names"] # list of 30 strings
noun_names = ckpt["noun_names"] # list of 50 strings
# Inference β frames: (1, 16, 3, 224, 224) float32 ImageNet-normalised
with torch.no_grad():
verb_logits, hand_logits, noun_logits = model(frames)
verb = verb_names[verb_logits.argmax(1).item()]
print("Predicted verb:", verb)
Citation
If you use this model, please cite the HD-EPIC dataset and VideoMAE:
@inproceedings{tong2022videomae,
title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
booktitle={NeurIPS},
year={2022}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for cstokl3/videomae-hdepic-verb-recognition
Base model
MCG-NJU/videomae-base-finetuned-ssv2