RTMPose-m Distill — blur-robust 2D hand keypoints for sign language video

RTMPose-m (21 hand keypoints, SimCC, 256×256) fine-tuned via self-distillation on degraded video — pseudo-labels produced by the model itself on clean frames, training inputs artificially degraded — to keep tracking hands through low resolution and motion blur, the main failure modes of off-the-shelf hand pose models on real-world sign language footage.

21-keypoint hand skeleton correctly placed on a heavily motion-blurred hand
Model output on a heavily motion-blurred frame: the skeleton stays on the fingers. PyTorch and ONNX Runtime outputs are byte-identical (assets/output_pytorch.jpg vs assets/output_onnxruntime.jpg).

Compared to the base RTMPose-m Hand5 checkpoint, this model:

retains hands under high confidence thresholds: at thr 0.3 it keeps 98.0% of hand detections vs 93.1% for the base model, so you can raise the threshold to cut false positives without losing recall;
detects hands the base model misses on hard frames (motion blur during fast signing, crossed/interlocked hands, hands pressed against the body): at thr 0.3 it fires on 2,672 frames (52% of a test video) where the base model returns nothing;
produces temporally smoother keypoints: ~39% less frame-to-frame jitter at thr 0.3, which directly reduces ragged keypoint sequences fed into downstream sign language models (Uni-Sign, streaming/wait-k translation pipelines);
does not regress on clean frames — on sharp, unoccluded frames the two models are visually indistinguishable.

Same architecture, same input size, same 21-keypoint COCO hand skeleton as the original — a drop-in replacement for the rtmpose-m_simcc-hand5 checkpoint in any mmpose / rtmlib / mmdeploy pipeline.

Files

File	Description
`rtmpose-m_hand_distill-256x256-a996d9ec.pth`	PyTorch weights (EMA, epoch 100), mmpose format, 55 MB
`rtmpose-m_hand_distill.py`	mmpose/mmengine training and inference config
`degrade_video.py`	Video degradation script used to build the "dirty" half of the training set (opencv + numpy only)
`onnx/rtmpose-m-distill-256x256.onnx`	ONNX export (opset 11, dynamic batch, FP32), outputs `simcc_x`/`simcc_y`
`onnx/deploy.json`, `onnx/pipeline.json`	mmdeploy SDK configs for the ONNX model
`assets/`	PyTorch vs ONNX Runtime output parity check (byte-identical)

How it was trained

Self-distillation on degraded video — the model is its own teacher:

Pseudo-labels. The base RTMPose-m Hand5 checkpoint with a hand-crop pipeline was run offline over the original clean FullHD frames of the Slovo Russian Sign Language video dataset, producing hand crops with 21-keypoint pseudo-labels.
Input degradation. 50% of the source videos were then degraded (the "dirty" half) with the included degrade_video.py, targeting the dominant real-world failure mode — low source resolution: the full frame is downscaled so its short side lands around 300 px (randomized per clip), then resized back to the original size (INTER_AREA down, bilinear up), so teacher coordinates taken from the clean frames stay valid. The degradation toolkit also includes optical-flow-based motion blur (Farneback flow, accumulated along the flow field), gamma/lighting shift, Gaussian noise and JPEG compression, organized into severity profiles 1–5. Degradation is applied to the full frame before hand cropping (so crops don't retain more detail than a real low-res source would have), and per-clip seeding (crc32(filename) + seed) makes it fully reproducible. The student therefore learns to predict sharp-frame keypoints from corrupted inputs.
Fine-tuning. The student — the same RTMPose-m Hand5 checkpoint that produced the labels — was fine-tuned for 100 epochs on the resulting handset_mix set: 300,238 training crops, 33,347 validation crops (16,697 clean / 16,650 dirty).

Training setup: AdamW (lr 4e-4, wd 0.05), batch 1024, cosine schedule, AMP, EMA (ExpMomentumEMA, momentum 2e-4), flip/rotate/scale augmentation, seed 21. Single NVIDIA RTX PRO 6000 Blackwell GPU, PyTorch 2.7.0 / CUDA 12.8 / MMEngine 0.10.7, ~10 h wall-clock. Full details in rtmpose-m_hand_distill.py.

Held-out validation against pseudo-labels (mixed clean + dirty, 33,347 crops): the released checkpoint is the EMA weights at epoch 100 — PCK@0.2 (bbox-normalized) 0.9893, EPE 5.96 px. Best raw validation score during training was PCK 0.9896 / EPE 5.87 at epoch 42; the validation curve is flat from roughly epoch 20 onward.

Evaluation vs the base model

Side-by-side comparison on a sign language test video (~5,100 frames), hand retention relative to detections at thr 0.1:

Confidence threshold	0.1	0.15	0.2	0.3
Hand retention, base	100%	98.8%	97.3%	93.1%
Hand retention, this model	100%	99.8%	99.4%	98.0%
Frames where this model detects a hand and base does not	431 (8%)	875 (17%)	1,338 (26%)	2,672 (52%)

Frame-to-frame keypoint jitter at thr 0.3 is ~39% lower than the base model. The frames recovered by this model are dominated by motion blur during fast signing, crossed/interlocked hands, and hands pressed against the torso; visual inspection confirms the recovered skeletons lie on the fingers rather than being spurious detections.

Recommended operating point: thr 0.2–0.3 (the base model effectively requires thr ≤ 0.15 to avoid dropping hands).

Usage

mmpose

from mmpose.apis import init_model, inference_topdown

model = init_model(
    'rtmpose-m_hand_distill.py',
    'rtmpose-m_hand_distill-256x256-a996d9ec.pth',
    device='cuda:0',
)
results = inference_topdown(model, 'hand_crop.jpg')
keypoints = results[0].pred_instances.keypoints  # (1, 21, 2)
scores = results[0].pred_instances.keypoint_scores  # (1, 21)

ONNX Runtime (no mmpose dependency)

import cv2
import numpy as np
import onnxruntime as ort

sess = ort.InferenceSession('onnx/rtmpose-m-distill-256x256.onnx')

img = cv2.imread('hand_crop.jpg')  # BGR hand crop
inp = cv2.resize(img, (256, 256))[:, :, ::-1].astype(np.float32)  # to RGB
inp = (inp - [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375]
inp = inp.transpose(2, 0, 1)[None]

simcc_x, simcc_y = sess.run(None, {'input': inp.astype(np.float32)})
# SimCC decode: argmax over each axis, divide by split ratio (2.0)
x = simcc_x[0].argmax(axis=1) / 2.0  # (21,) in 256x256 crop coords
y = simcc_y[0].argmax(axis=1) / 2.0
conf = np.minimum(simcc_x[0].max(axis=1), simcc_y[0].max(axis=1))

The ONNX file is also compatible with rtmlib and the mmdeploy SDK (use onnx/ as the SDK model directory).

Limitations

Pseudo-label supervision. Training targets are the base model's own predictions, not human annotations; systematic biases of RTMPose-m Hand5 are inherited rather than corrected. Validation PCK/EPE above are measured against pseudo-labels, not ground truth.
Comparative evaluation. The improvement numbers compare this model against its own teacher on sign language video; the model has not been benchmarked on GT hand datasets (FreiHAND, COCO-WholeBody Hand).
Domain. Tuned on Russian Sign Language studio-style recordings (frontal upper-body view, 194 signers). Behavior on in-the-wild hands (egocentric, object interaction, outdoor) is untested.
Top-down model: expects a hand crop; you still need a hand/person detector upstream.

Training data attribution

Pseudo-labels and training crops are derived from the Slovo Russian Sign Language dataset (SaluteDevices), distributed under a variant of CC BY-SA 4.0. The dataset itself is not included in this repository — only model weights.

Citations

@misc{jiang2023rtmpose,
  title={RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose},
  author={Jiang, Tao and Lu, Peng and Zhang, Li and Ma, Ningsheng and Han, Rui and Lyu, Chengqi and Li, Yining and Chen, Kai},
  year={2023},
  eprint={2303.07399},
  archivePrefix={arXiv}
}

@inproceedings{kapitanov2023slovo,
  title={Slovo: Russian Sign Language Dataset},
  author={Kapitanov, Alexander and Kvanchiani, Karina and Nagaev, Alexander and Petrova, Elizaveta},
  booktitle={International Conference on Computer Vision Systems},
  pages={63--73},
  year={2023},
  organization={Springer}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for tasmulaev/rtmpose-m-distill

RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose

Paper • 2303.07399 • Published Mar 13, 2023