RTMPose-s — LiteRT (on-device real-time 2D human pose, fully-GPU)

RTMPose (mmpose, CSPNeXt backbone + RTMCC/SimCC head) top-down 2D human pose, converted to LiteRT and running fully on the CompiledModel GPU (ML Drift) on Android. Estimates 17 COCO keypoints for a single centered person — the SOTA real-time pose model, device-verified end-to-end.

On-device (Pixel 8a, Tensor G3 — verified)


nodes on GPU	256 / 256 LITERT_CL (full residency)
inference	~4 ms (256×192)
size	11.1 MB (fp16)
accuracy	device-vs-PyTorch SimCC corr 0.999, keypoints within 0.3 px (max 1 px)

image[1,3,256,192] (ImageNet 0-255 norm) →[GPU: CSPNeXt + RTMCC]→ simcc_x[1,17,384], simcc_y[1,17,512]

The SimCC head emits two 1D distributions per keypoint; argmax over the bins (÷ split=2) gives the pixel x/y.

Minimal usage

Android (Kotlin, CompiledModel GPU)

val model = CompiledModel.create(context.assets, "rtmpose_s_fp16.tflite",
    CompiledModel.Options(Accelerator.GPU), null)
val inputs = model.createInputBuffers()
val outputs = model.createOutputBuffers()
inputs[0].writeFloat(chw)              // [1,3,256,192] mmpose mean/std (0-255 RGB), NCHW
model.run(inputs, outputs)
val simccX = outputs[0].readFloat()    // [1,17,384]
val simccY = outputs[1].readFloat()    // [1,17,512]; keypoint = argmax / 2

Python (desktop verification)

MEAN = np.array([123.675, 116.28, 103.53], np.float32)
STD  = np.array([58.395, 57.12, 57.375], np.float32)
import numpy as np
from PIL import Image
from ai_edge_litert.interpreter import Interpreter

img = Image.open("person.jpg").convert("RGB").resize((192, 256))  # centered subject crop
x = ((np.asarray(img, np.float32) - MEAN) / STD).transpose(2, 0, 1)[None]

it = Interpreter(model_path="rtmpose_s_fp16.tflite"); it.allocate_tensors()
it.set_tensor(it.get_input_details()[0]["index"], x); it.invoke()
od = it.get_output_details()
sx, sy = (it.get_tensor(o["index"])[0] for o in od)              # [17,384], [17,512]
if sx.shape[-1] != 384: sx, sy = sy, sx                          # identify by bin count
kx, ky = sx.argmax(-1) / 2.0, sy.argmax(-1) / 2.0                 # 17 keypoints, px in 192x256
for i, (a, b) in enumerate(zip(kx, ky)):
    print(f"kp{i}: ({a:.1f}, {b:.1f})")

How it converts (litert-torch) — two numerically-exact re-authorings

Both are on-device-only Mali issues: they pass the desktop op-check and report full LITERT_CL residency, yet the device output was wrong until fixed (residency ≠ correctness):

ScaleNorm (RMS norm) fp16 overflow → all-zero head. The RTMCC ScaleNorm input reaches ≈ |274|, so its channel Σ x² ≈ 3.6M overflows fp16 (max 65504) on the Mali delegate (which reduces in fp16 even for an fp32 graph) → norm = ∞ → x/∞ = 0 → the whole head collapses to zero. Fix: scale x down by S=64 before squaring, then rescale (math-identical) — a SafeRMSNorm.
GAU attention act@act BMM → broadcast-reduce. The Gated Attention Unit's q@kᵀ and kernel@v are activation×activation batch-matmuls that the Mali delegate mis-computes; at K=17 tokens the exact replacement is (q[:,:,None,:]·k[:,None,:,:]).sum(-1).

Result: banned ops NONE, all tensors ≤4D, tflite-vs-torch corr 1.0, device-vs-torch corr 0.999.

Preprocessing

Center-crop to 3:4, resize to 192×256, ImageNet 0-255 normalize (mean [123.675, 116.28, 103.53], std [58.395, 57.12, 57.375]), NCHW planar. Top-down — expects one roughly-centered person.

License

Apache-2.0. Upstream: open-mmlab/mmpose RTMPose-s.

Downloads last month: 24

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support