mascot-openpose-detect-test
β οΈ TEST RELEASE β non-commercial, scheduled for deletion. This package is fine-tuned on only six original mascot characters owned by the author and is intended as an internal baseline for an upcoming generic mascot OpenPose detector. It will be removed from HuggingFace once the generic successor is published. Use only for evaluation / reproducibility of the related research notes; do not assume it generalises to other mascot characters.
Two-stage OpenPose keypoint detector for chibi / kemono mascot characters. Stage 1 is a 7-class bounding-box detector (RTMDet-tiny); Stage 2 is a body-keypoint regressor (ViTPose-L 17-kp head, with an optional RTMPose-s 25-kp variant). Hand keypoints are filled in by a separate template fitter on the consumer side. All models were fine-tuned on a small dataset distilled from GPT-5.5 Vision and exported to ONNX for portable inference (CUDA / CPU; no mmdet / mmpose runtime dependency).
The package is the "test" companion of an in-progress generic mascot
OpenPose detector trained on a much larger and more diverse character
pool. Once that generic model is ready, the production-grade weights
will live at grmchn/mascot-openpose-detect (Apache 2.0) and this
test repo will be deleted.
License β non-commercial
The package as a whole is licensed CC-BY-NC 4.0 because it includes ViTPose-L weights whose pretraining chain (Facebook MAE, CC-BY-NC 4.0) imposes the non-commercial constraint. Per-component license map:
| Component | Upstream license | Effective on this package |
|---|---|---|
| RTMDet-tiny code & ImageNet/COCO pretrain | Apache 2.0 (mmdetection) | Apache 2.0 |
| RTMPose-s code & COCO pretrain | Apache 2.0 (mmpose) | Apache 2.0 |
| ViTPose code | Apache 2.0 | Apache 2.0 |
| Facebook MAE pretrained backbone (used by ViTPose-L) | CC-BY-NC 4.0 | CC-BY-NC 4.0 |
| AI Challenger / MPII training data (referenced by ViTPose) | Non-commercial / BSD | Inherited as non-commercial |
| Author's fine-tune contributions | CC-BY-NC 4.0 | CC-BY-NC 4.0 |
If you only need the bbox detector and the rtmpose_s keypoint variant, the actual weights you load are Apache 2.0 derivatives end-to-end. The repo as a whole is marked CC-BY-NC because shipping ViTPose-L alongside contaminates the bundle. The future generic-mascot release will avoid the MAE-pretrained ViTPose-L and ship under Apache 2.0.
Sister project
grmchn/kemono-face-detectβ face-parts detector for the same character set.
Repository contents
grmchn/mascot-openpose-detect-test/
βββ bbox/
β βββ model.onnx # RTMDet-tiny, static 1Γ3Γ640Γ640, 6 outputs
β βββ classes.json # idx -> class name (7 classes)
β βββ decode_params.json # normalisation / strides / NMS params
βββ keypoint/
β βββ vitpose_l/ # accuracy-first (CC-BY-NC due to MAE backbone)
β β βββ model.onnx # ViTPose-L heatmap, 1Γ3Γ256Γ192 -> 1Γ17Γ64Γ48
β β βββ meta.json
β βββ rtmpose_s/ # speed-first, Apache-2.0-derivative weights
β βββ model.onnx # RTMPose-s simcc, 1Γ3Γ256Γ192 -> simcc_x/simcc_y
β βββ meta.json
βββ README.md # this file
βββ LICENSE # CC-BY-NC 4.0 (package-level)
snapshot_download(repo_id="grmchn/mascot-openpose-detect-test", allow_patterns=...)
lets consumers fetch only the parts they need:
| Use case | allow_patterns |
|---|---|
| Full ComfyUI pipeline (default) | ["bbox/*", "keypoint/vitpose_l/*"] |
| Speed-optimized (Apache 2.0 derivatives only) | ["bbox/*", "keypoint/rtmpose_s/*"] |
| BBox-only (Stage 1 standalone) | ["bbox/*"] |
| Browser inference (onnxruntime-web) | ["bbox/*", "keypoint/rtmpose_s/*"] |
Stage 1: BBox detector (bbox/)
Seven-class bounding-box detector. The detector emits at most one box per class (any of which may be absent when occluded or off-canvas):
| index | name | notes |
|---|---|---|
| 0 | full |
Full character body |
| 1 | head |
Head including ears / hair |
| 2 | body |
Torso (clothing included) |
| 3 | hand_left |
Left hand from the character's perspective |
| 4 | hand_right |
Right hand from the character's perspective |
| 5 | foot_left |
Left foot or shoe equivalent |
| 6 | foot_right |
Right foot or shoe equivalent |
Left / right follow the OpenPose anatomical convention (character's own left / right, not the screen's left / right).
ONNX I/O:
- Input:
(1, 3, 640, 640)float32, RGB, ImageNet-normalised - Outputs:
cls_0,cls_1,cls_2andbbox_0,bbox_1,bbox_2(multi-scale anchor-free decode at strides 8 / 16 / 32; consumer runs decode + per-class NMS β seedecode_params.json).
Stage 2: Keypoint regressor (keypoint/)
Two interchangeable variants:
| Subpath | Architecture | Head output | Size | Use when |
|---|---|---|---|---|
vitpose_l/ |
ViTPose-L (HeatmapHead, COCO-17) | heatmap 1Γ17Γ64Γ48 |
~330 MB (fp32 ~1.2 GB) | Accuracy-first, GPU available, non-commercial only |
rtmpose_s/ |
RTMPose-s (RTMCCHead, DWPose-25) | simcc simcc_x 1Γ25Γ384, simcc_y 1Γ25Γ512 |
~22 MB | Speed-first, CPU / onnxruntime-web, Apache 2.0 derivative |
ViTPose-L emits 17 COCO-style keypoints; the consumer post-processes them into the 25-point DWPose layout (synthesises neck from shoulders, big-toes from foot bbox, reserved indices stay zero). RTMPose-s emits the 25-point DWPose layout directly, but its big-toe predictions on mascots are unreliable, so consumers typically overwrite indices 18-19 from the foot bbox center.
A complete reference post-processing implementation (RTMDet anchor-free
decode + NMS, ViTPose heatmap argmax with sub-pixel refine, RTMPose
simcc argmax, COCO-17 β DWPose-25 lift, toe synthesis, hand template)
is in
specs/proportionchanger_node_spec.md
of the source repository.
Output layout (after post-processing)
25-point DWPose body skeleton. Indices 0-17 are the standard 18-point OpenPose body, 18 / 19 are the big-toe pair, and 20-24 are reserved slots that consumers should leave at zero.
The pipeline downstream of this model emits a ComfyUI-compatible
POSE_KEYPOINT JSON with normalized 0-1 coordinates and a fixed
canvas_width / canvas_height metadata pair.
Training data
- Roughly 600 frames sampled from short videos of six original mascot characters owned by the author (used with permission).
- Bounding-box and keypoint labels were generated by GPT-5.5 Vision and reviewed through lightweight HTML / tkinter editors before training.
- ViTPose-L was fine-tuned with COCO-17 labels (drops the synthetic
neck / toes so the official ViTPose-L weights load
strict=True). - RTMPose-s was fine-tuned with the full 25-point DWPose layout.
- The raw images and videos are not redistributed.
Usage (onnxruntime β full pipeline sketch)
import json
import cv2
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
local = snapshot_download(
repo_id="grmchn/mascot-openpose-detect-test",
allow_patterns=["bbox/*", "keypoint/vitpose_l/*"],
)
bbox_session = ort.InferenceSession(
f"{local}/bbox/model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
kp_session = ort.InferenceSession(
f"{local}/keypoint/vitpose_l/model.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
decode_params = json.loads(open(f"{local}/bbox/decode_params.json").read())
classes_map = json.loads(open(f"{local}/bbox/classes.json").read())
kp_meta = json.loads(open(f"{local}/keypoint/vitpose_l/meta.json").read())
# 1. Stage 1: letterbox -> normalise -> RTMDet -> decode + NMS -> {class: bbox}
# 2. Stage 2: top-down crop full/body bbox -> ViTPose -> heatmap argmax
# 3. coco17 -> dwpose25 lift, toe synthesis, hand template (see spec doc)
Caveats
- Six characters only: the training set is fine-tuned to specifically six original characters. Recall on visually different mascots is noticeably worse β this is a known-bad baseline that the upcoming generic release is meant to fix.
- Hand and foot boxes are deliberately drawn to the outer silhouette for characters whose fingers or toes are not rendered. This keeps Stage 2 keypoint placement and the hand-template fitter stable, but it means the boxes are not anatomically tight.
- Hand keypoints are deliberately not produced by the keypoint model. The pipeline computes them separately using the shoulder-elbow-wrist vector and a neutral 21-point hand template scaled to the bounding box from Stage 1.
- ONNX export uses static input shapes. Batched inference requires re-exporting with dynamic axes.
License
CC-BY-NC 4.0 β non-commercial use only. See the LICENSE file in
this repo. Per-component breakdown above clarifies which parts (RTMDet
bbox + RTMPose-s) would be Apache 2.0 if shipped separately; the future
grmchn/mascot-openpose-detect (without the MAE-derived ViTPose-L) will
ship as Apache 2.0.