rtmw-l-256x192

This is a Hugging Face-compatible port of rtmw-l-256x192 from OpenMMLab MMPose.

RTMW (Real-Time Multi-person Wholebody pose estimation) extends RTMPose to predict 133 wholebody keypoints covering the body, face, hands, and feet simultaneously.

The model is trained on Cocktail14 — a mixture of 14 public datasets — and evaluated on COCO-WholeBody v1.0 val.

Model description

Architecture: CSPNeXt backbone + CSPNeXtPAFPN neck + RTMWHead (SimCC with GAU)
Keypoints: 133 (17 body + 6 feet + 68 face + 21 left hand + 21 right hand)
Codec: SimCC with Gaussian label smoothing
Uses custom code — load with trust_remote_code=True

Performance on COCO-WholeBody v1.0 val

Detector: human AP = 56.4 on COCO val2017.

Model	Input	Body AP	Body AR	Foot AP	Foot AR	Face AP	Face AR	Hand AP	Hand AR	Whole AP	Whole AR
rtmw-m-256x192	256×192	0.676	0.747	0.671	0.794	0.783	0.854	0.491	0.604	0.582	0.673
rtmw-l-256x192 (this model)	256×192	0.743	0.807	0.763	0.868	0.834	0.889	0.598	0.701	0.660	0.746
rtmw-x-256x192	256×192	0.746	0.808	0.770	0.869	0.844	0.896	0.610	0.710	0.672	0.752
rtmw-l-384x288	384×288	0.761	0.824	0.793	0.885	0.884	0.921	0.663	0.752	0.701	0.780
rtmw-x-384x288	384×288	0.763	0.826	0.796	0.888	0.884	0.923	0.664	0.755	0.702	0.781

Usage

Single cropped person (model-space coordinates)

from transformers import AutoConfig, AutoModel, AutoImageProcessor
from PIL import Image
import torch

config = AutoConfig.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True)
model = AutoModel.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True)
model.eval()

processor = AutoImageProcessor.from_pretrained("akore/rtmw-l-256x192")
# Supply a pre-cropped person patch (will be resized to model input resolution)
image = Image.open("person_crop.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    # coordinate_mode="model"  → raw 288×384 (or 192×256) pixel coords
    outputs = model(**inputs, coordinate_mode="model")

# outputs.keypoints: (1, 133, 2) — [x, y] in model-input pixel space
# outputs.scores:    (1, 133)    — confidence in [0, 1]
print(outputs.keypoints.shape, outputs.scores.shape)

Coordinate modes

The coordinate_mode argument controls how keypoints are expressed:

Mode	Description	Extra arg
`"model"`	Raw SimCC space — same resolution as the model input (e.g. 288×384)	—
`"image"`	Original image pixel coordinates, rescaled via the person bounding box	`bbox=[x1,y1,x2,y2]`
`"root_relative"`	Origin at mid-hip, unit = half inter-hip distance (hips at ±1)	—

import torch

# Mode 1 — model space (no extra args)
out_model = model(**inputs, coordinate_mode="model")

# Mode 2 — image space (pass the bbox used to crop the person)
bbox = torch.tensor([[120, 40, 380, 620]])   # [x1, y1, x2, y2] in original image
out_image = model(**inputs, coordinate_mode="image", bbox=bbox)

# Mode 3 — root-relative (skeleton-normalised, useful for action recognition)
out_root = model(**inputs, coordinate_mode="root_relative")

End-to-end with RTMDet person detector

Uses akore/rtmdet-tiny for detection and RTMW for pose estimation. Both preprocessors handle all the resize / normalise bookkeeping — no manual mean/std or scaling arithmetic required.

from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# ── Load once ────────────────────────────────────────────────────────────────
rtmdet      = AutoModel.from_pretrained("akore/rtmdet-tiny",          trust_remote_code=True).eval()
rtmdet_proc = AutoImageProcessor.from_pretrained("akore/rtmdet-tiny")

rtmw      = AutoModel.from_pretrained("akore/rtmw-l-256x192", trust_remote_code=True).eval()
rtmw_proc = AutoImageProcessor.from_pretrained("akore/rtmw-l-256x192")

# ── Load image ───────────────────────────────────────────────────────────────
pil_img = Image.open("photo.jpg").convert("RGB")
orig_w, orig_h = pil_img.size   # PIL gives (width, height)

# ── Detect people — boxes returned in original image pixel coords ─────────────
det_inputs = rtmdet_proc(images=pil_img, return_tensors="pt")
with torch.no_grad():
    det_out = rtmdet(pixel_values=det_inputs["pixel_values"],
                     original_size=(orig_h, orig_w))   # ← scale happens inside

boxes  = det_out.boxes[0]    # (N, 4)  already in original image pixels
labels = det_out.labels[0]   # (N,)
scores = det_out.scores[0]   # (N,)

# ── Batch all person crops through the RTMW preprocessor ─────────────────────
person_boxes = [
    (boxes[i], scores[i]) for i in range(len(labels))
    if int(labels[i]) == 0 and float(scores[i]) > 0.3
]

if person_boxes:
    # PIL.Image.crop handles resize bookkeeping; processor handles normalize + batch
    crops  = [pil_img.crop(b.tolist()) for b, _ in person_boxes]
    bboxes = torch.stack([b for b, _ in person_boxes])            # (P, 4)

    inputs = rtmw_proc(images=crops, return_tensors="pt")         # resize + normalize
    with torch.no_grad():
        out = rtmw(pixel_values=inputs["pixel_values"],
                   coordinate_mode="image", bbox=bboxes)

    # out.keypoints: (P, 133, 2)  — [x, y] in original image pixels
    # out.scores:    (P, 133)     — confidence in [0, 1]
    for i, (_, sc) in enumerate(person_boxes):
        visible = (out.scores[i] > 0.3).sum()
        print(f"Person {float(sc):.2f}: {visible} / 133 keypoints visible")

Cocktail14 training datasets

Dataset	Link
AI Challenger	mmpose docs
CrowdPose	mmpose docs
MPII	mmpose docs
sub-JHMDB	mmpose docs
Halpe	mmpose docs
PoseTrack18	mmpose docs
COCO-WholeBody	GitHub
UBody	GitHub
Human-Art	mmpose docs
WFLW	project page
300W	project page
COFW	project page
LaPa	GitHub
InterHand	project page

Score normalization

Raw SimCC confidence scores vary across model variants (0–1 for 256×192 models, 0–10 for 384×288 models). This port applies fixed min–max normalization so all model variants output scores in [0, 1]. The score_min and score_max hyperparameters used are stored in the config and were determined empirically from real-world inference.

Citation

@article{jiang2024rtmw,
  title={RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation},
  author={Jiang, Tao and Xie, Xinchen and Li, Yining},
  journal={arXiv preprint arXiv:2407.08634},
  year={2024}
}

@misc{https://doi.org/10.48550/arxiv.2303.07399,
  doi = {10.48550/ARXIV.2303.07399},
  url = {https://arxiv.org/abs/2303.07399},
  author = {Jiang, Tao and Lu, Peng and Zhang, Li and Ma, Ningsheng and Han, Rui and Lyu, Chengqi and Li, Yining and Chen, Kai},
  title = {RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}

@misc{mmpose2020,
    title={OpenMMLab Pose Estimation Toolbox and Benchmark},
    author={MMPose Contributors},
    howpublished = {\url{https://github.com/open-mmlab/mmpose}},
    year={2020}
}

@misc{lyu2022rtmdet,
  title={RTMDet: An Empirical Study of Designing Real-Time Object Detectors},
  author={Chengqi Lyu and Wenwei Zhang and Haian Huang and Yue Zhou and Yudong Wang and Yanyi Liu and Shilong Zhang and Kai Chen},
  year={2022},
  eprint={2212.07784},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@inproceedings{jin2020whole,
  title={Whole-Body Human Pose Estimation in the Wild},
  author={Jin, Sheng and Xu, Lumin and Xu, Jin and Wang, Can and Liu, Wentao and Qian, Chen and Ouyang, Wanli and Luo, Ping},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

Downloads last month: 706

Safetensors

Model size

57.3M params

Tensor type

F32

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including akore/rtmw-l-256x192

RTMW

Collection

RTMW (Real-Time Multi-person Wholebody pose estimation) extends RTMPose to predict 133 wholebody keypoints • 5 items • Updated Mar 23

Papers for akore/rtmw-l-256x192