You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
HOPformer and its components are released for non-commercial research use only (CC BY-NC 4.0). The hand meshes/features derive from MANO and WiLoR (CC BY-NC-ND), and the models build on ARCTIC and EPIC-Kitchens — each under its own non-commercial terms. By requesting access you agree to use these models for non-commercial research only and to comply with the MANO, WiLoR, ARCTIC and EPIC-Kitchens licenses.
Log in or Sign Up to review the conditions and access this model content.
HOPformer — pretrained models
Pretrained checkpoints for HOPformer, a transformer for 3D hand–object pose estimation from a single RGB image (MANO hand meshes + articulated/rigid object pose). Accompanies "Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation" (Bansal et al., ECCV 2026).
- Code + setup: https://github.com/Sid2697/HOPformer
- Project page: https://sid2697.github.io/epic-contact
- Paper: Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation — Siddhant Bansal, Zhifan Zhu, Shashank Tripathi, Jiahe Zhao, Michael J. Black, Dima Damen (University of Bristol; Max Planck Institute for Intelligent Systems). ECCV 2026.
HOPformer tackles 3D hand–object pose estimation in unconstrained egocentric video by conditioning object pose on hand priors, predicting both hands and the object in a single forward pass. It is trained on ARCTIC and on EPIC-Contact — a new in-the-wild dataset (~2.3K clips, 62.3K frames) with dense 3D hand–object contact annotations. HOPformer reaches 82.4% success rate on ARCTIC (+6.2 pts over prior SOTA) and 20.7 mm contact deviation on EPIC-Contact.
Training is a chain: p1 (exocentric) → p2 (ARCTIC egocentric, from p1) → EPIC
(egocentric, from p2). Each checkpoint is inference / fine-tune ready (optimizer
state removed); load with --load_ckpt <file>.
| File | Stage | --setup |
--dataset |
Paper |
|---|---|---|---|---|
p1_exo_epoch22.ckpt |
exocentric pre-training | p1 |
arctic |
Supp. (ARCTIC exocentric) |
p2_arctic_ego_epoch8.ckpt |
egocentric fine-tune (from p1) | p2 |
arctic |
Table 1 (ARCTIC egocentric) |
epic_epoch125.ckpt |
egocentric fine-tune (from p2) | p2 |
epic |
Table 2 (EPIC-Contact) |
Architecture. DINOv2 ViT-G scene backbone + WiLoR hand features, 12-layer cross-attention decoder, 6D rotation. LR: linear warm-up 1e-7 → peak over the first 5% of steps then cosine → 1e-7 (peak 5e-5 for p1, 3e-5 for p2 and EPIC). Each released checkpoint is its best-validation epoch (early stopping).
Validated metrics (val split)
| Model | MPJPE ↓ | CDev ↓ | MDev ↓ | SR@0.05 ↑ | Cls ↑ | Notes |
|---|---|---|---|---|---|---|
| p1 (ARCTIC exo) | 12.7 | 24.5 | 5.4 | 90.3 | 99.7 | AAE 4.8 |
| p2 (ARCTIC ego) | 16.1 | 31.9 | 7.3 | 82.4 | 99.5 | AAE 5.0 |
| EPIC | 19.9 | 20.7 | 11.4 | 29.8 | 52.9 | symmetric CDev/MDev; ACC h/o 2.5/4.1 |
CDev/MDev/MPJPE in mm; SR@0.05 and Cls in %. For EPIC (symmetric objects) CDev/MDev/ACC use the symmetry-aware variants and SR@0.05 is ADD-S–based. See the repository README ("Reproduce the paper numbers") for the eval commands and metric → JSON-key mapping.
Repository contents
*.ckpt— the three checkpoints above (~7.3 GB each).predictions/{p2_arctic,epic}_eval.tar— per-sample prediction dumps on the validation split (the outputsextract_predictsproduced). Evaluate these directly to reproduce Table 1 / Table 2 without re-running the model. (The p1 dump is too large to host; reproduce p1 by re-running extraction — see the repository README.)results/*agg_metrics.json— the reference metric values for each model.
Usage
Download selectively — the checkpoints and the prediction tars are in the same repo:
# just the checkpoints (~22 GB):
hf download Sid2697/HOPformer --include "*.ckpt" --local-dir release_models
# (optional) the eval prediction dumps for reproduction (~43 GB):
hf download Sid2697/HOPformer --include "predictions/*" "results/*" --local-dir .
Then follow the HOPformer repository README to set
up the environment (incl. the smplx 21-joint patch, the MANO model, and WiLoR
weights/config — not included here), and run
e.g. --load_ckpt release_models/p2_arctic_ego_epoch8.ckpt. Extract the egocentric ARCTIC
model with --precision 32 (FP16 can overflow the hand head); p1 and EPIC are FP16-stable.
License
Released under CC BY-NC 4.0 (non-commercial). The hand model/parameters derive from MANO; hand features from WiLoR (CC BY-NC-ND); the models build on ARCTIC and EPIC-Kitchens — each retains its own non-commercial terms. WiLoR weights and the MANO model are not distributed here and must be obtained from their sources.
Citation
@inproceedings{bansal2026hopformer,
title = {Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation},
author = {Bansal, Siddhant and Zhu, Zhifan and Tripathi, Shashank and
Zhao, Jiahe and Black, Michael J. and Damen, Dima},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
