rtmw-l-256x192
This is a Hugging Face-compatible port of rtmw-l-256x192 from OpenMMLab MMPose.
RTMW (Real-Time Multi-person Wholebody pose estimation) extends RTMPose to predict 133 wholebody keypoints covering the body, face, hands, and feet simultaneously.
The model is trained on Cocktail14 β a mixture of 14 public datasets β and evaluated on COCO-WholeBody v1.0 val.
Model description
- Architecture: CSPNeXt backbone + CSPNeXtPAFPN neck + RTMWHead (SimCC with GAU)
- Keypoints: 133 (17 body + 6 feet + 68 face + 21 left hand + 21 right hand)
- Codec: SimCC with Gaussian label smoothing
- Uses custom code β load with
trust_remote_code=True
Performance on COCO-WholeBody v1.0 val
Detector: human AP = 56.4 on COCO val2017.
| Model | Input | Body AP | Body AR | Foot AP | Foot AR | Face AP | Face AR | Hand AP | Hand AR | Whole AP | Whole AR |
|---|---|---|---|---|---|---|---|---|---|---|---|
| rtmw-m-256x192 | 256Γ192 | 0.676 | 0.747 | 0.671 | 0.794 | 0.783 | 0.854 | 0.491 | 0.604 | 0.582 | 0.673 |
| rtmw-l-256x192 (this model) | 256Γ192 | 0.743 | 0.807 | 0.763 | 0.868 | 0.834 | 0.889 | 0.598 | 0.701 | 0.660 | 0.746 |
| rtmw-x-256x192 | 256Γ192 | 0.746 | 0.808 | 0.770 | 0.869 | 0.844 | 0.896 | 0.610 | 0.710 | 0.672 | 0.752 |
| rtmw-l-384x288 | 384Γ288 | 0.761 | 0.824 | 0.793 | 0.885 | 0.884 | 0.921 | 0.663 | 0.752 | 0.701 | 0.780 |
| rtmw-x-384x288 | 384Γ288 | 0.763 | 0.826 | 0.796 | 0.888 | 0.884 | 0.923 | 0.664 | 0.755 | 0.702 | 0.781 |
Usage
Single cropped person (model-space coordinates)
from transformers import AutoConfig, AutoModel, AutoImageProcessor
from PIL import Image
import torch
config = AutoConfig.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True)
model = AutoModel.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True)
model.eval()
processor = AutoImageProcessor.from_pretrained("akore/rtmw-l-256x192")
# Supply a pre-cropped person patch (will be resized to model input resolution)
image = Image.open("person_crop.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
# coordinate_mode="model" β raw 288Γ384 (or 192Γ256) pixel coords
outputs = model(**inputs, coordinate_mode="model")
# outputs.keypoints: (1, 133, 2) β [x, y] in model-input pixel space
# outputs.scores: (1, 133) β confidence in [0, 1]
print(outputs.keypoints.shape, outputs.scores.shape)
Coordinate modes
The coordinate_mode argument controls how keypoints are expressed:
| Mode | Description | Extra arg |
|---|---|---|
"model" |
Raw SimCC space β same resolution as the model input (e.g. 288Γ384) | β |
"image" |
Original image pixel coordinates, rescaled via the person bounding box | bbox=[x1,y1,x2,y2] |
"root_relative" |
Origin at mid-hip, unit = half inter-hip distance (hips at Β±1) | β |
import torch
# Mode 1 β model space (no extra args)
out_model = model(**inputs, coordinate_mode="model")
# Mode 2 β image space (pass the bbox used to crop the person)
bbox = torch.tensor([[120, 40, 380, 620]]) # [x1, y1, x2, y2] in original image
out_image = model(**inputs, coordinate_mode="image", bbox=bbox)
# Mode 3 β root-relative (skeleton-normalised, useful for action recognition)
out_root = model(**inputs, coordinate_mode="root_relative")
End-to-end with RTMDet person detector
Uses akore/rtmdet-tiny for detection and
RTMW for pose estimation. Both preprocessors handle all the resize / normalise bookkeeping
β no manual mean/std or scaling arithmetic required.
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch
# ββ Load once ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
rtmdet = AutoModel.from_pretrained("akore/rtmdet-tiny", trust_remote_code=True).eval()
rtmdet_proc = AutoImageProcessor.from_pretrained("akore/rtmdet-tiny")
rtmw = AutoModel.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True).eval()
rtmw_proc = AutoImageProcessor.from_pretrained("akore/rtmw-l-256x192")
# ββ Load image βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pil_img = Image.open("photo.jpg").convert("RGB")
orig_w, orig_h = pil_img.size # PIL gives (width, height)
# ββ Detect people β boxes returned in original image pixel coords βββββββββββββ
det_inputs = rtmdet_proc(images=pil_img, return_tensors="pt")
with torch.no_grad():
det_out = rtmdet(pixel_values=det_inputs["pixel_values"],
original_size=(orig_h, orig_w)) # β scale happens inside
boxes = det_out.boxes[0] # (N, 4) already in original image pixels
labels = det_out.labels[0] # (N,)
scores = det_out.scores[0] # (N,)
# ββ Batch all person crops through the RTMW preprocessor βββββββββββββββββββββ
person_boxes = [
(boxes[i], scores[i]) for i in range(len(labels))
if int(labels[i]) == 0 and float(scores[i]) > 0.3
]
if person_boxes:
# PIL.Image.crop handles resize bookkeeping; processor handles normalize + batch
crops = [pil_img.crop(b.tolist()) for b, _ in person_boxes]
bboxes = torch.stack([b for b, _ in person_boxes]) # (P, 4)
inputs = rtmw_proc(images=crops, return_tensors="pt") # resize + normalize
with torch.no_grad():
out = rtmw(pixel_values=inputs["pixel_values"],
coordinate_mode="image", bbox=bboxes)
# out.keypoints: (P, 133, 2) β [x, y] in original image pixels
# out.scores: (P, 133) β confidence in [0, 1]
for i, (_, sc) in enumerate(person_boxes):
visible = (out.scores[i] > 0.3).sum()
print(f"Person {float(sc):.2f}: {visible} / 133 keypoints visible")
Cocktail14 training datasets
| Dataset | Link |
|---|---|
| AI Challenger | mmpose docs |
| CrowdPose | mmpose docs |
| MPII | mmpose docs |
| sub-JHMDB | mmpose docs |
| Halpe | mmpose docs |
| PoseTrack18 | mmpose docs |
| COCO-WholeBody | GitHub |
| UBody | GitHub |
| Human-Art | mmpose docs |
| WFLW | project page |
| 300W | project page |
| COFW | project page |
| LaPa | GitHub |
| InterHand | project page |
Score normalization
Raw SimCC confidence scores vary across model variants (0β1 for 256Γ192 models, 0β10 for 384Γ288 models). This port applies fixed minβmax normalization so all model variants output scores in [0, 1]. The score_min and score_max hyperparameters used are stored in the config and were determined empirically from real-world inference.
Citation
@article{jiang2024rtmw,
title={RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation},
author={Jiang, Tao and Xie, Xinchen and Li, Yining},
journal={arXiv preprint arXiv:2407.08634},
year={2024}
}
@misc{https://doi.org/10.48550/arxiv.2303.07399,
doi = {10.48550/ARXIV.2303.07399},
url = {https://arxiv.org/abs/2303.07399},
author = {Jiang, Tao and Lu, Peng and Zhang, Li and Ma, Ningsheng and Han, Rui and Lyu, Chengqi and Li, Yining and Chen, Kai},
title = {RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose},
publisher = {arXiv},
year = {2023},
copyright = {Creative Commons Attribution 4.0 International}
}
@misc{mmpose2020,
title={OpenMMLab Pose Estimation Toolbox and Benchmark},
author={MMPose Contributors},
howpublished = {\url{https://github.com/open-mmlab/mmpose}},
year={2020}
}
@misc{lyu2022rtmdet,
title={RTMDet: An Empirical Study of Designing Real-Time Object Detectors},
author={Chengqi Lyu and Wenwei Zhang and Haian Huang and Yue Zhou and Yudong Wang and Yanyi Liu and Shilong Zhang and Kai Chen},
year={2022},
eprint={2212.07784},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@inproceedings{jin2020whole,
title={Whole-Body Human Pose Estimation in the Wild},
author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2020}
}
- Downloads last month
- 75