RTMPose-m Distill β blur-robust 2D hand keypoints for sign language video
RTMPose-m (21 hand keypoints, SimCC, 256Γ256) fine-tuned via self-distillation on degraded video β pseudo-labels produced by the model itself on clean frames, training inputs artificially degraded β to keep tracking hands through low resolution and motion blur, the main failure modes of off-the-shelf hand pose models on real-world sign language footage.
Model output on a heavily motion-blurred frame: the skeleton stays on the fingers. PyTorch and ONNX Runtime outputs are byte-identical (assets/output_pytorch.jpg vs assets/output_onnxruntime.jpg).
Compared to the base RTMPose-m Hand5 checkpoint, this model:
- retains hands under high confidence thresholds: at thr 0.3 it keeps 98.0% of hand detections vs 93.1% for the base model, so you can raise the threshold to cut false positives without losing recall;
- detects hands the base model misses on hard frames (motion blur during fast signing, crossed/interlocked hands, hands pressed against the body): at thr 0.3 it fires on 2,672 frames (52% of a test video) where the base model returns nothing;
- produces temporally smoother keypoints: ~39% less frame-to-frame jitter at thr 0.3, which directly reduces ragged keypoint sequences fed into downstream sign language models (Uni-Sign, streaming/wait-k translation pipelines);
- does not regress on clean frames β on sharp, unoccluded frames the two models are visually indistinguishable.
Same architecture, same input size, same 21-keypoint COCO hand skeleton as the original β a drop-in replacement for the rtmpose-m_simcc-hand5 checkpoint in any mmpose / rtmlib / mmdeploy pipeline.
Files
| File | Description |
|---|---|
rtmpose-m_hand_distill-256x256-a996d9ec.pth |
PyTorch weights (EMA, epoch 100), mmpose format, 55 MB |
rtmpose-m_hand_distill.py |
mmpose/mmengine training and inference config |
degrade_video.py |
Video degradation script used to build the "dirty" half of the training set (opencv + numpy only) |
onnx/rtmpose-m-distill-256x256.onnx |
ONNX export (opset 11, dynamic batch, FP32), outputs simcc_x/simcc_y |
onnx/deploy.json, onnx/pipeline.json |
mmdeploy SDK configs for the ONNX model |
assets/ |
PyTorch vs ONNX Runtime output parity check (byte-identical) |
How it was trained
Self-distillation on degraded video β the model is its own teacher:
- Pseudo-labels. The base RTMPose-m Hand5 checkpoint with a hand-crop pipeline was run offline over the original clean FullHD frames of the Slovo Russian Sign Language video dataset, producing hand crops with 21-keypoint pseudo-labels.
- Input degradation. 50% of the source videos were then degraded (the "dirty" half) with the included
degrade_video.py, targeting the dominant real-world failure mode β low source resolution: the full frame is downscaled so its short side lands around 300 px (randomized per clip), then resized back to the original size (INTER_AREAdown, bilinear up), so teacher coordinates taken from the clean frames stay valid. The degradation toolkit also includes optical-flow-based motion blur (Farneback flow, accumulated along the flow field), gamma/lighting shift, Gaussian noise and JPEG compression, organized into severity profiles 1β5. Degradation is applied to the full frame before hand cropping (so crops don't retain more detail than a real low-res source would have), and per-clip seeding (crc32(filename) + seed) makes it fully reproducible. The student therefore learns to predict sharp-frame keypoints from corrupted inputs. - Fine-tuning. The student β the same RTMPose-m Hand5 checkpoint that produced the labels β was fine-tuned for 100 epochs on the resulting handset_mix set: 300,238 training crops, 33,347 validation crops (16,697 clean / 16,650 dirty).
Training setup: AdamW (lr 4e-4, wd 0.05), batch 1024, cosine schedule, AMP, EMA (ExpMomentumEMA, momentum 2e-4), flip/rotate/scale augmentation, seed 21. Single NVIDIA RTX PRO 6000 Blackwell GPU, PyTorch 2.7.0 / CUDA 12.8 / MMEngine 0.10.7, ~10 h wall-clock. Full details in rtmpose-m_hand_distill.py.
Held-out validation against pseudo-labels (mixed clean + dirty, 33,347 crops): the released checkpoint is the EMA weights at epoch 100 β PCK@0.2 (bbox-normalized) 0.9893, EPE 5.96 px. Best raw validation score during training was PCK 0.9896 / EPE 5.87 at epoch 42; the validation curve is flat from roughly epoch 20 onward.
Evaluation vs the base model
Side-by-side comparison on a sign language test video (~5,100 frames), hand retention relative to detections at thr 0.1:
| Confidence threshold | 0.1 | 0.15 | 0.2 | 0.3 |
|---|---|---|---|---|
| Hand retention, base | 100% | 98.8% | 97.3% | 93.1% |
| Hand retention, this model | 100% | 99.8% | 99.4% | 98.0% |
| Frames where this model detects a hand and base does not | 431 (8%) | 875 (17%) | 1,338 (26%) | 2,672 (52%) |
Frame-to-frame keypoint jitter at thr 0.3 is ~39% lower than the base model. The frames recovered by this model are dominated by motion blur during fast signing, crossed/interlocked hands, and hands pressed against the torso; visual inspection confirms the recovered skeletons lie on the fingers rather than being spurious detections.
Recommended operating point: thr 0.2β0.3 (the base model effectively requires thr β€ 0.15 to avoid dropping hands).
Usage
mmpose
from mmpose.apis import init_model, inference_topdown
model = init_model(
'rtmpose-m_hand_distill.py',
'rtmpose-m_hand_distill-256x256-a996d9ec.pth',
device='cuda:0',
)
results = inference_topdown(model, 'hand_crop.jpg')
keypoints = results[0].pred_instances.keypoints # (1, 21, 2)
scores = results[0].pred_instances.keypoint_scores # (1, 21)
ONNX Runtime (no mmpose dependency)
import cv2
import numpy as np
import onnxruntime as ort
sess = ort.InferenceSession('onnx/rtmpose-m-distill-256x256.onnx')
img = cv2.imread('hand_crop.jpg') # BGR hand crop
inp = cv2.resize(img, (256, 256))[:, :, ::-1].astype(np.float32) # to RGB
inp = (inp - [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375]
inp = inp.transpose(2, 0, 1)[None]
simcc_x, simcc_y = sess.run(None, {'input': inp.astype(np.float32)})
# SimCC decode: argmax over each axis, divide by split ratio (2.0)
x = simcc_x[0].argmax(axis=1) / 2.0 # (21,) in 256x256 crop coords
y = simcc_y[0].argmax(axis=1) / 2.0
conf = np.minimum(simcc_x[0].max(axis=1), simcc_y[0].max(axis=1))
The ONNX file is also compatible with rtmlib and the mmdeploy SDK (use onnx/ as the SDK model directory).
Limitations
- Pseudo-label supervision. Training targets are the base model's own predictions, not human annotations; systematic biases of RTMPose-m Hand5 are inherited rather than corrected. Validation PCK/EPE above are measured against pseudo-labels, not ground truth.
- Comparative evaluation. The improvement numbers compare this model against its own teacher on sign language video; the model has not been benchmarked on GT hand datasets (FreiHAND, COCO-WholeBody Hand).
- Domain. Tuned on Russian Sign Language studio-style recordings (frontal upper-body view, 194 signers). Behavior on in-the-wild hands (egocentric, object interaction, outdoor) is untested.
- Top-down model: expects a hand crop; you still need a hand/person detector upstream.
Training data attribution
Pseudo-labels and training crops are derived from the Slovo Russian Sign Language dataset (SaluteDevices), distributed under a variant of CC BY-SA 4.0. The dataset itself is not included in this repository β only model weights.
Citations
@misc{jiang2023rtmpose,
title={RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose},
author={Jiang, Tao and Lu, Peng and Zhang, Li and Ma, Ningsheng and Han, Rui and Lyu, Chengqi and Li, Yining and Chen, Kai},
year={2023},
eprint={2303.07399},
archivePrefix={arXiv}
}
@inproceedings{kapitanov2023slovo,
title={Slovo: Russian Sign Language Dataset},
author={Kapitanov, Alexander and Kvanchiani, Karina and Nagaev, Alexander and Petrova, Elizaveta},
booktitle={International Conference on Computer Vision Systems},
pages={63--73},
year={2023},
organization={Springer}
}