mascot-openpose-detect-test

⚠️ TEST RELEASE — non-commercial, scheduled for deletion. This package is fine-tuned on only six original mascot characters owned by the author and is intended as an internal baseline for an upcoming generic mascot OpenPose detector. It will be removed from HuggingFace once the generic successor is published. Use only for evaluation / reproducibility of the related research notes; do not assume it generalises to other mascot characters.

Two-stage OpenPose keypoint detector for chibi / kemono mascot characters. Stage 1 is a 7-class bounding-box detector (RTMDet-tiny); Stage 2 is a body-keypoint regressor (ViTPose-L 17-kp head, with an optional RTMPose-s 25-kp variant). Hand keypoints are filled in by a separate template fitter on the consumer side. All models were fine-tuned on a small dataset distilled from GPT-5.5 Vision and exported to ONNX for portable inference (CUDA / CPU; no mmdet / mmpose runtime dependency).

The package is the "test" companion of an in-progress generic mascot OpenPose detector trained on a much larger and more diverse character pool. Once that generic model is ready, the production-grade weights will live at grmchn/mascot-openpose-detect (Apache 2.0) and this test repo will be deleted.

License — non-commercial

The package as a whole is licensed CC-BY-NC 4.0 because it includes ViTPose-L weights whose pretraining chain (Facebook MAE, CC-BY-NC 4.0) imposes the non-commercial constraint. Per-component license map:

Component	Upstream license	Effective on this package
RTMDet-tiny code & ImageNet/COCO pretrain	Apache 2.0 (mmdetection)	Apache 2.0
RTMPose-s code & COCO pretrain	Apache 2.0 (mmpose)	Apache 2.0
ViTPose code	Apache 2.0	Apache 2.0
Facebook MAE pretrained backbone (used by ViTPose-L)	CC-BY-NC 4.0	CC-BY-NC 4.0
AI Challenger / MPII training data (referenced by ViTPose)	Non-commercial / BSD	Inherited as non-commercial
Author's fine-tune contributions	CC-BY-NC 4.0	CC-BY-NC 4.0

If you only need the bbox detector and the rtmpose_s keypoint variant, the actual weights you load are Apache 2.0 derivatives end-to-end. The repo as a whole is marked CC-BY-NC because shipping ViTPose-L alongside contaminates the bundle. The future generic-mascot release will avoid the MAE-pretrained ViTPose-L and ship under Apache 2.0.

Sister project

grmchn/kemono-face-detect — face-parts detector for the same character set.

Repository contents

grmchn/mascot-openpose-detect-test/
├── bbox/
│   ├── model.onnx           # RTMDet-tiny, static 1×3×640×640, 6 outputs
│   ├── classes.json         # idx -> class name (7 classes)
│   └── decode_params.json   # normalisation / strides / NMS params
├── keypoint/
│   ├── vitpose_l/           # accuracy-first (CC-BY-NC due to MAE backbone)
│   │   ├── model.onnx       # ViTPose-L heatmap, 1×3×256×192 -> 1×17×64×48
│   │   └── meta.json
│   └── rtmpose_s/           # speed-first, Apache-2.0-derivative weights
│       ├── model.onnx       # RTMPose-s simcc, 1×3×256×192 -> simcc_x/simcc_y
│       └── meta.json
├── README.md                # this file
└── LICENSE                  # CC-BY-NC 4.0 (package-level)

snapshot_download(repo_id="grmchn/mascot-openpose-detect-test", allow_patterns=...) lets consumers fetch only the parts they need:

Use case	`allow_patterns`
Full ComfyUI pipeline (default)	`["bbox/", "keypoint/vitpose_l/"]`
Speed-optimized (Apache 2.0 derivatives only)	`["bbox/", "keypoint/rtmpose_s/"]`
BBox-only (Stage 1 standalone)	`["bbox/*"]`
Browser inference (onnxruntime-web)	`["bbox/", "keypoint/rtmpose_s/"]`

Stage 1: BBox detector (`bbox/`)

Seven-class bounding-box detector. The detector emits at most one box per class (any of which may be absent when occluded or off-canvas):

index	name	notes
0	`full`	Full character body
1	`head`	Head including ears / hair
2	`body`	Torso (clothing included)
3	`hand_left`	Left hand from the character's perspective
4	`hand_right`	Right hand from the character's perspective
5	`foot_left`	Left foot or shoe equivalent
6	`foot_right`	Right foot or shoe equivalent

Left / right follow the OpenPose anatomical convention (character's own left / right, not the screen's left / right).

ONNX I/O:

Input: (1, 3, 640, 640) float32, RGB, ImageNet-normalised
Outputs: cls_0, cls_1, cls_2 and bbox_0, bbox_1, bbox_2 (multi-scale anchor-free decode at strides 8 / 16 / 32; consumer runs decode + per-class NMS — see decode_params.json).

Stage 2: Keypoint regressor (`keypoint/`)

Two interchangeable variants:

Subpath	Architecture	Head output	Size	Use when
`vitpose_l/`	ViTPose-L (HeatmapHead, COCO-17)	heatmap `1×17×64×48`	~330 MB (fp32 ~1.2 GB)	Accuracy-first, GPU available, non-commercial only
`rtmpose_s/`	RTMPose-s (RTMCCHead, DWPose-25)	simcc `simcc_x 1×25×384`, `simcc_y 1×25×512`	~22 MB	Speed-first, CPU / onnxruntime-web, Apache 2.0 derivative

ViTPose-L emits 17 COCO-style keypoints; the consumer post-processes them into the 25-point DWPose layout (synthesises neck from shoulders, big-toes from foot bbox, reserved indices stay zero). RTMPose-s emits the 25-point DWPose layout directly, but its big-toe predictions on mascots are unreliable, so consumers typically overwrite indices 18-19 from the foot bbox center.

A complete reference post-processing implementation (RTMDet anchor-free decode + NMS, ViTPose heatmap argmax with sub-pixel refine, RTMPose simcc argmax, COCO-17 → DWPose-25 lift, toe synthesis, hand template) is in specs/proportionchanger_node_spec.md of the source repository.

Output layout (after post-processing)

25-point DWPose body skeleton. Indices 0-17 are the standard 18-point OpenPose body, 18 / 19 are the big-toe pair, and 20-24 are reserved slots that consumers should leave at zero.

The pipeline downstream of this model emits a ComfyUI-compatible POSE_KEYPOINT JSON with normalized 0-1 coordinates and a fixed canvas_width / canvas_height metadata pair.

Training data

Roughly 600 frames sampled from short videos of six original mascot characters owned by the author (used with permission).
Bounding-box and keypoint labels were generated by GPT-5.5 Vision and reviewed through lightweight HTML / tkinter editors before training.
ViTPose-L was fine-tuned with COCO-17 labels (drops the synthetic neck / toes so the official ViTPose-L weights load strict=True).
RTMPose-s was fine-tuned with the full 25-point DWPose layout.
The raw images and videos are not redistributed.

Usage (onnxruntime — full pipeline sketch)

import json
import cv2
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download

local = snapshot_download(
    repo_id="grmchn/mascot-openpose-detect-test",
    allow_patterns=["bbox/*", "keypoint/vitpose_l/*"],
)
bbox_session = ort.InferenceSession(
    f"{local}/bbox/model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
kp_session = ort.InferenceSession(
    f"{local}/keypoint/vitpose_l/model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
decode_params = json.loads(open(f"{local}/bbox/decode_params.json").read())
classes_map   = json.loads(open(f"{local}/bbox/classes.json").read())
kp_meta       = json.loads(open(f"{local}/keypoint/vitpose_l/meta.json").read())

# 1. Stage 1: letterbox -> normalise -> RTMDet -> decode + NMS -> {class: bbox}
# 2. Stage 2: top-down crop full/body bbox -> ViTPose -> heatmap argmax
# 3. coco17 -> dwpose25 lift, toe synthesis, hand template (see spec doc)

Caveats

Six characters only: the training set is fine-tuned to specifically six original characters. Recall on visually different mascots is noticeably worse — this is a known-bad baseline that the upcoming generic release is meant to fix.
Hand and foot boxes are deliberately drawn to the outer silhouette for characters whose fingers or toes are not rendered. This keeps Stage 2 keypoint placement and the hand-template fitter stable, but it means the boxes are not anatomically tight.
Hand keypoints are deliberately not produced by the keypoint model. The pipeline computes them separately using the shoulder-elbow-wrist vector and a neutral 21-point hand template scaled to the bounding box from Stage 1.
ONNX export uses static input shapes. Batched inference requires re-exporting with dynamic axes.

License

CC-BY-NC 4.0 — non-commercial use only. See the LICENSE file in this repo. Per-component breakdown above clarifies which parts (RTMDet bbox + RTMPose-s) would be Apache 2.0 if shipped separately; the future grmchn/mascot-openpose-detect (without the MAE-derived ViTPose-L) will ship as Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track