HD-EPIC Action Recognition

Egocentric kitchen action recognition from short video clips. Given a first-person clip, predicts:

  • Verb β€” what action is being performed (30 classes)
  • Hand side β€” which hand(s) are active (left / right / both)
  • Object β€” what each hand is interacting with (EPIC-KITCHENS 90 classes)

Model

Three VideoMAE checkpoints fine-tuned on HD-EPIC (9 participants, 2024 HD kitchen recordings). At inference, logits are averaged across all three models and three temporal sampling windows (uniform / middle-third / final-half) β€” no spatial TTA, which hurts egocentric accuracy.

Checkpoint Backbone Pretrain Val MCA
checkpoints/hdepic_videomae_v2/best.pt VideoMAE-Base Kinetics-400 45.5%
checkpoints/hdepic_videomae_ssv2/best.pt VideoMAE-Base Something-Something-v2 44.9%
checkpoints/hdepic_videomae_large_kinetics/best.pt VideoMAE-Large Kinetics-400 ~43%

MCA = Mean Class Accuracy over 30 verb classes on held-out HD-EPIC clips.

Object detection uses a YOLOv11l-seg model fine-tuned on EPIC-KITCHENS (90 object classes).
Hand side is detected with MediaPipe.

Usage

git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing/action_recognition

pip install ultralytics mediapipe torch torchvision transformers opencv-python huggingface_hub

# Download all weights
huggingface-cli download cstokl3/hdepic-action-recognition \
    --local-dir . --local-dir-use-symlinks False

# Run ensemble pipeline on a directory of clips
python full_pipeline.py \
    --backbone videomae \
    --hdepic_ensemble \
    --clips_dir /path/to/clips \
    --json results.json

Output per clip:

{
  "clip_1.mp4": {
    "verb": "reach",
    "hand": "right",
    "left_object": "none",
    "right_object": "sauce"
  }
}

Verb Classes (30)

take put wash open close turn-on cut turn-off pour mix move remove throw dry shake scoop adjust squeeze press flip turn check scrub pull pat lift hold drop transition reach

Performance on Lab Clips

6 held-out egocentric clips from a separate lab kitchen setup (unseen during training):

Clip Ground truth Predicted
clip_1 reach reach βœ“
clip_2 lift lift βœ“
clip_3 pour pour βœ“
clip_4 put put βœ“
clip_5 shake / pour shake βœ“
clip_6 put / place press βœ—

5/6 correct with 3-model Γ— 3-strategy ensemble.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support