HD-EPIC Action Recognition

Egocentric kitchen action recognition from short video clips. Given a first-person clip, predicts:

Verb — what action is being performed (30 classes)
Hand side — which hand(s) are active (left / right / both)
Object — what each hand is interacting with (EPIC-KITCHENS 90 classes)

Model

Three VideoMAE checkpoints fine-tuned on HD-EPIC (9 participants, 2024 HD kitchen recordings). At inference, logits are averaged across all three models and three temporal sampling windows (uniform / middle-third / final-half) — no spatial TTA, which hurts egocentric accuracy.

Checkpoint	Backbone	Pretrain	Val MCA
`checkpoints/hdepic_videomae_v2/best.pt`	VideoMAE-Base	Kinetics-400	45.5%
`checkpoints/hdepic_videomae_ssv2/best.pt`	VideoMAE-Base	Something-Something-v2	44.9%
`checkpoints/hdepic_videomae_large_kinetics/best.pt`	VideoMAE-Large	Kinetics-400	~43%

MCA = Mean Class Accuracy over 30 verb classes on held-out HD-EPIC clips.

Object detection uses a YOLOv11l-seg model fine-tuned on EPIC-KITCHENS (90 object classes).
Hand side is detected with MediaPipe.

Usage

git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing/action_recognition

pip install ultralytics mediapipe torch torchvision transformers opencv-python huggingface_hub

# Download all weights
huggingface-cli download cstokl3/hdepic-action-recognition \
    --local-dir . --local-dir-use-symlinks False

# Run ensemble pipeline on a directory of clips
python full_pipeline.py \
    --backbone videomae \
    --hdepic_ensemble \
    --clips_dir /path/to/clips \
    --json results.json

Output per clip:

{
  "clip_1.mp4": {
    "verb": "reach",
    "hand": "right",
    "left_object": "none",
    "right_object": "sauce"
  }
}

Verb Classes (30)

take put wash open close turn-on cut turn-off pour mix move remove throw dry shake scoop adjust squeeze press flip turn check scrub pull pat lift hold drop transition reach

Performance on Lab Clips

6 held-out egocentric clips from a separate lab kitchen setup (unseen during training):

Clip	Ground truth	Predicted
clip_1	reach	reach ✓
clip_2	lift	lift ✓
clip_3	pour	pour ✓
clip_4	put	put ✓
clip_5	shake / pour	shake ✓
clip_6	put / place	press ✗

5/6 correct with 3-model × 3-strategy ensemble.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support