rtmw-m-256x192

This is a Hugging Face-compatible port of rtmw-m-256x192 from OpenMMLab MMPose.

RTMW (Real-Time Multi-person Wholebody pose estimation) extends RTMPose to predict 133 wholebody keypoints covering the body, face, hands, and feet simultaneously.

The model is trained on Cocktail14 β€” a mixture of 14 public datasets β€” and evaluated on COCO-WholeBody v1.0 val.

Model description

  • Architecture: CSPNeXt backbone + CSPNeXtPAFPN neck + RTMWHead (SimCC with GAU)
  • Keypoints: 133 (17 body + 6 feet + 68 face + 21 left hand + 21 right hand)
  • Codec: SimCC with Gaussian label smoothing
  • Uses custom code β€” load with trust_remote_code=True

Performance on COCO-WholeBody v1.0 val

Detector: human AP = 56.4 on COCO val2017.

Model Input Body AP Body AR Foot AP Foot AR Face AP Face AR Hand AP Hand AR Whole AP Whole AR
rtmw-m-256x192 (this model) 256Γ—192 0.676 0.747 0.671 0.794 0.783 0.854 0.491 0.604 0.582 0.673
rtmw-l-256x192 256Γ—192 0.743 0.807 0.763 0.868 0.834 0.889 0.598 0.701 0.660 0.746
rtmw-x-256x192 256Γ—192 0.746 0.808 0.770 0.869 0.844 0.896 0.610 0.710 0.672 0.752
rtmw-l-384x288 384Γ—288 0.761 0.824 0.793 0.885 0.884 0.921 0.663 0.752 0.701 0.780
rtmw-x-384x288 384Γ—288 0.763 0.826 0.796 0.888 0.884 0.923 0.664 0.755 0.702 0.781

Usage

Single cropped person (model-space coordinates)

from transformers import AutoConfig, AutoModel, AutoImageProcessor
from PIL import Image
import torch

config = AutoConfig.from_pretrained("akore/rtmw-m-256x192", trust_remote_code=True)
model = AutoModel.from_pretrained("akore/rtmw-m-256x192", trust_remote_code=True)
model.eval()

processor = AutoImageProcessor.from_pretrained("akore/rtmw-m-256x192")
# Supply a pre-cropped person patch (will be resized to model input resolution)
image = Image.open("person_crop.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    # coordinate_mode="model"  β†’ raw 288Γ—384 (or 192Γ—256) pixel coords
    outputs = model(**inputs, coordinate_mode="model")

# outputs.keypoints: (1, 133, 2) β€” [x, y] in model-input pixel space
# outputs.scores:    (1, 133)    β€” confidence in [0, 1]
print(outputs.keypoints.shape, outputs.scores.shape)

Coordinate modes

The coordinate_mode argument controls how keypoints are expressed:

Mode Description Extra arg
"model" Raw SimCC space β€” same resolution as the model input (e.g. 288Γ—384) β€”
"image" Original image pixel coordinates, rescaled via the person bounding box bbox=[x1,y1,x2,y2]
"root_relative" Origin at mid-hip, unit = half inter-hip distance (hips at Β±1) β€”
import torch

# Mode 1 β€” model space (no extra args)
out_model = model(**inputs, coordinate_mode="model")

# Mode 2 β€” image space (pass the bbox used to crop the person)
bbox = torch.tensor([[120, 40, 380, 620]])   # [x1, y1, x2, y2] in original image
out_image = model(**inputs, coordinate_mode="image", bbox=bbox)

# Mode 3 β€” root-relative (skeleton-normalised, useful for action recognition)
out_root = model(**inputs, coordinate_mode="root_relative")

End-to-end with RTMDet person detector

Uses akore/rtmdet-tiny for detection and RTMW for pose estimation. Both preprocessors handle all the resize / normalise bookkeeping β€” no manual mean/std or scaling arithmetic required.

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# ── Load once ────────────────────────────────────────────────────────────────
rtmdet      = AutoModel.from_pretrained("akore/rtmdet-tiny",          trust_remote_code=True).eval()
rtmdet_proc = AutoImageProcessor.from_pretrained("akore/rtmdet-tiny")

rtmw      = AutoModel.from_pretrained("akore/rtmw-m-256x192", trust_remote_code=True).eval()
rtmw_proc = AutoImageProcessor.from_pretrained("akore/rtmw-m-256x192")

# ── Load image ───────────────────────────────────────────────────────────────
pil_img = Image.open("photo.jpg").convert("RGB")
orig_w, orig_h = pil_img.size   # PIL gives (width, height)

# ── Detect people β€” boxes returned in original image pixel coords ─────────────
det_inputs = rtmdet_proc(images=pil_img, return_tensors="pt")
with torch.no_grad():
    det_out = rtmdet(pixel_values=det_inputs["pixel_values"],
                     original_size=(orig_h, orig_w))   # ← scale happens inside

boxes  = det_out.boxes[0]    # (N, 4)  already in original image pixels
labels = det_out.labels[0]   # (N,)
scores = det_out.scores[0]   # (N,)

# ── Batch all person crops through the RTMW preprocessor ─────────────────────
person_boxes = [
    (boxes[i], scores[i]) for i in range(len(labels))
    if int(labels[i]) == 0 and float(scores[i]) > 0.3
]

if person_boxes:
    # PIL.Image.crop handles resize bookkeeping; processor handles normalize + batch
    crops  = [pil_img.crop(b.tolist()) for b, _ in person_boxes]
    bboxes = torch.stack([b for b, _ in person_boxes])            # (P, 4)

    inputs = rtmw_proc(images=crops, return_tensors="pt")         # resize + normalize
    with torch.no_grad():
        out = rtmw(pixel_values=inputs["pixel_values"],
                   coordinate_mode="image", bbox=bboxes)

    # out.keypoints: (P, 133, 2)  β€” [x, y] in original image pixels
    # out.scores:    (P, 133)     β€” confidence in [0, 1]
    for i, (_, sc) in enumerate(person_boxes):
        visible = (out.scores[i] > 0.3).sum()
        print(f"Person {float(sc):.2f}: {visible} / 133 keypoints visible")

Cocktail14 training datasets

Dataset Link
AI Challenger mmpose docs
CrowdPose mmpose docs
MPII mmpose docs
sub-JHMDB mmpose docs
Halpe mmpose docs
PoseTrack18 mmpose docs
COCO-WholeBody GitHub
UBody GitHub
Human-Art mmpose docs
WFLW project page
300W project page
COFW project page
LaPa GitHub
InterHand project page

Score normalization

Raw SimCC confidence scores vary across model variants (0–1 for 256Γ—192 models, 0–10 for 384Γ—288 models). This port applies fixed min–max normalization so all model variants output scores in [0, 1]. The score_min and score_max hyperparameters used are stored in the config and were determined empirically from real-world inference.

Citation

@article{jiang2024rtmw,
  title={RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation},
  author={Jiang, Tao and Xie, Xinchen and Li, Yining},
  journal={arXiv preprint arXiv:2407.08634},
  year={2024}
}

@misc{https://doi.org/10.48550/arxiv.2303.07399,
  doi = {10.48550/ARXIV.2303.07399},
  url = {https://arxiv.org/abs/2303.07399},
  author = {Jiang, Tao and Lu, Peng and Zhang, Li and Ma, Ningsheng and Han, Rui and Lyu, Chengqi and Li, Yining and Chen, Kai},
  title = {RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}

@misc{mmpose2020,
    title={OpenMMLab Pose Estimation Toolbox and Benchmark},
    author={MMPose Contributors},
    howpublished = {\url{https://github.com/open-mmlab/mmpose}},
    year={2020}
}

@misc{lyu2022rtmdet,
  title={RTMDet: An Empirical Study of Designing Real-Time Object Detectors},
  author={Chengqi Lyu and Wenwei Zhang and Haian Huang and Yue Zhou and Yudong Wang and Yanyi Liu and Shilong Zhang and Kai Chen},
  year={2022},
  eprint={2212.07784},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@inproceedings{jin2020whole,
  title={Whole-Body Human Pose Estimation in the Wild},
  author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}
Downloads last month
81
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including akore/rtmw-m-256x192

Papers for akore/rtmw-m-256x192