HD-EPIC Action Recognition
Egocentric kitchen action recognition from short video clips. Given a first-person clip, predicts:
- Verb β what action is being performed (30 classes)
- Hand side β which hand(s) are active (left / right / both)
- Object β what each hand is interacting with (EPIC-KITCHENS 90 classes)
Model
Three VideoMAE checkpoints fine-tuned on HD-EPIC (9 participants, 2024 HD kitchen recordings). At inference, logits are averaged across all three models and three temporal sampling windows (uniform / middle-third / final-half) β no spatial TTA, which hurts egocentric accuracy.
| Checkpoint | Backbone | Pretrain | Val MCA |
|---|---|---|---|
checkpoints/hdepic_videomae_v2/best.pt |
VideoMAE-Base | Kinetics-400 | 45.5% |
checkpoints/hdepic_videomae_ssv2/best.pt |
VideoMAE-Base | Something-Something-v2 | 44.9% |
checkpoints/hdepic_videomae_large_kinetics/best.pt |
VideoMAE-Large | Kinetics-400 | ~43% |
MCA = Mean Class Accuracy over 30 verb classes on held-out HD-EPIC clips.
Object detection uses a YOLOv11l-seg model fine-tuned on EPIC-KITCHENS (90 object classes).
Hand side is detected with MediaPipe.
Usage
git clone https://github.com/cstokl3/bimanual_manip_data_processing
cd bimanual_manip_data_processing/action_recognition
pip install ultralytics mediapipe torch torchvision transformers opencv-python huggingface_hub
# Download all weights
huggingface-cli download cstokl3/hdepic-action-recognition \
--local-dir . --local-dir-use-symlinks False
# Run ensemble pipeline on a directory of clips
python full_pipeline.py \
--backbone videomae \
--hdepic_ensemble \
--clips_dir /path/to/clips \
--json results.json
Output per clip:
{
"clip_1.mp4": {
"verb": "reach",
"hand": "right",
"left_object": "none",
"right_object": "sauce"
}
}
Verb Classes (30)
take put wash open close turn-on cut turn-off pour mix
move remove throw dry shake scoop adjust squeeze press
flip turn check scrub pull pat lift hold drop
transition reach
Performance on Lab Clips
6 held-out egocentric clips from a separate lab kitchen setup (unseen during training):
| Clip | Ground truth | Predicted |
|---|---|---|
| clip_1 | reach | reach β |
| clip_2 | lift | lift β |
| clip_3 | pour | pour β |
| clip_4 | put | put β |
| clip_5 | shake / pour | shake β |
| clip_6 | put / place | press β |
5/6 correct with 3-model Γ 3-strategy ensemble.