mascot-openpose-detect-test

⚠️ TEST RELEASE β€” non-commercial, scheduled for deletion. This package is fine-tuned on only six original mascot characters owned by the author and is intended as an internal baseline for an upcoming generic mascot OpenPose detector. It will be removed from HuggingFace once the generic successor is published. Use only for evaluation / reproducibility of the related research notes; do not assume it generalises to other mascot characters.

Two-stage OpenPose keypoint detector for chibi / kemono mascot characters. Stage 1 is a 7-class bounding-box detector (RTMDet-tiny); Stage 2 is a body-keypoint regressor (ViTPose-L 17-kp head, with an optional RTMPose-s 25-kp variant). Hand keypoints are filled in by a separate template fitter on the consumer side. All models were fine-tuned on a small dataset distilled from GPT-5.5 Vision and exported to ONNX for portable inference (CUDA / CPU; no mmdet / mmpose runtime dependency).

The package is the "test" companion of an in-progress generic mascot OpenPose detector trained on a much larger and more diverse character pool. Once that generic model is ready, the production-grade weights will live at grmchn/mascot-openpose-detect (Apache 2.0) and this test repo will be deleted.

License β€” non-commercial

The package as a whole is licensed CC-BY-NC 4.0 because it includes ViTPose-L weights whose pretraining chain (Facebook MAE, CC-BY-NC 4.0) imposes the non-commercial constraint. Per-component license map:

Component Upstream license Effective on this package
RTMDet-tiny code & ImageNet/COCO pretrain Apache 2.0 (mmdetection) Apache 2.0
RTMPose-s code & COCO pretrain Apache 2.0 (mmpose) Apache 2.0
ViTPose code Apache 2.0 Apache 2.0
Facebook MAE pretrained backbone (used by ViTPose-L) CC-BY-NC 4.0 CC-BY-NC 4.0
AI Challenger / MPII training data (referenced by ViTPose) Non-commercial / BSD Inherited as non-commercial
Author's fine-tune contributions CC-BY-NC 4.0 CC-BY-NC 4.0

If you only need the bbox detector and the rtmpose_s keypoint variant, the actual weights you load are Apache 2.0 derivatives end-to-end. The repo as a whole is marked CC-BY-NC because shipping ViTPose-L alongside contaminates the bundle. The future generic-mascot release will avoid the MAE-pretrained ViTPose-L and ship under Apache 2.0.

Sister project

Repository contents

grmchn/mascot-openpose-detect-test/
β”œβ”€β”€ bbox/
β”‚   β”œβ”€β”€ model.onnx           # RTMDet-tiny, static 1Γ—3Γ—640Γ—640, 6 outputs
β”‚   β”œβ”€β”€ classes.json         # idx -> class name (7 classes)
β”‚   └── decode_params.json   # normalisation / strides / NMS params
β”œβ”€β”€ keypoint/
β”‚   β”œβ”€β”€ vitpose_l/           # accuracy-first (CC-BY-NC due to MAE backbone)
β”‚   β”‚   β”œβ”€β”€ model.onnx       # ViTPose-L heatmap, 1Γ—3Γ—256Γ—192 -> 1Γ—17Γ—64Γ—48
β”‚   β”‚   └── meta.json
β”‚   └── rtmpose_s/           # speed-first, Apache-2.0-derivative weights
β”‚       β”œβ”€β”€ model.onnx       # RTMPose-s simcc, 1Γ—3Γ—256Γ—192 -> simcc_x/simcc_y
β”‚       └── meta.json
β”œβ”€β”€ README.md                # this file
└── LICENSE                  # CC-BY-NC 4.0 (package-level)

snapshot_download(repo_id="grmchn/mascot-openpose-detect-test", allow_patterns=...) lets consumers fetch only the parts they need:

Use case allow_patterns
Full ComfyUI pipeline (default) ["bbox/*", "keypoint/vitpose_l/*"]
Speed-optimized (Apache 2.0 derivatives only) ["bbox/*", "keypoint/rtmpose_s/*"]
BBox-only (Stage 1 standalone) ["bbox/*"]
Browser inference (onnxruntime-web) ["bbox/*", "keypoint/rtmpose_s/*"]

Stage 1: BBox detector (bbox/)

Seven-class bounding-box detector. The detector emits at most one box per class (any of which may be absent when occluded or off-canvas):

index name notes
0 full Full character body
1 head Head including ears / hair
2 body Torso (clothing included)
3 hand_left Left hand from the character's perspective
4 hand_right Right hand from the character's perspective
5 foot_left Left foot or shoe equivalent
6 foot_right Right foot or shoe equivalent

Left / right follow the OpenPose anatomical convention (character's own left / right, not the screen's left / right).

ONNX I/O:

  • Input: (1, 3, 640, 640) float32, RGB, ImageNet-normalised
  • Outputs: cls_0, cls_1, cls_2 and bbox_0, bbox_1, bbox_2 (multi-scale anchor-free decode at strides 8 / 16 / 32; consumer runs decode + per-class NMS β€” see decode_params.json).

Stage 2: Keypoint regressor (keypoint/)

Two interchangeable variants:

Subpath Architecture Head output Size Use when
vitpose_l/ ViTPose-L (HeatmapHead, COCO-17) heatmap 1Γ—17Γ—64Γ—48 ~330 MB (fp32 ~1.2 GB) Accuracy-first, GPU available, non-commercial only
rtmpose_s/ RTMPose-s (RTMCCHead, DWPose-25) simcc simcc_x 1Γ—25Γ—384, simcc_y 1Γ—25Γ—512 ~22 MB Speed-first, CPU / onnxruntime-web, Apache 2.0 derivative

ViTPose-L emits 17 COCO-style keypoints; the consumer post-processes them into the 25-point DWPose layout (synthesises neck from shoulders, big-toes from foot bbox, reserved indices stay zero). RTMPose-s emits the 25-point DWPose layout directly, but its big-toe predictions on mascots are unreliable, so consumers typically overwrite indices 18-19 from the foot bbox center.

A complete reference post-processing implementation (RTMDet anchor-free decode + NMS, ViTPose heatmap argmax with sub-pixel refine, RTMPose simcc argmax, COCO-17 β†’ DWPose-25 lift, toe synthesis, hand template) is in specs/proportionchanger_node_spec.md of the source repository.

Output layout (after post-processing)

25-point DWPose body skeleton. Indices 0-17 are the standard 18-point OpenPose body, 18 / 19 are the big-toe pair, and 20-24 are reserved slots that consumers should leave at zero.

The pipeline downstream of this model emits a ComfyUI-compatible POSE_KEYPOINT JSON with normalized 0-1 coordinates and a fixed canvas_width / canvas_height metadata pair.

Training data

  • Roughly 600 frames sampled from short videos of six original mascot characters owned by the author (used with permission).
  • Bounding-box and keypoint labels were generated by GPT-5.5 Vision and reviewed through lightweight HTML / tkinter editors before training.
  • ViTPose-L was fine-tuned with COCO-17 labels (drops the synthetic neck / toes so the official ViTPose-L weights load strict=True).
  • RTMPose-s was fine-tuned with the full 25-point DWPose layout.
  • The raw images and videos are not redistributed.

Usage (onnxruntime β€” full pipeline sketch)

import json
import cv2
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download

local = snapshot_download(
    repo_id="grmchn/mascot-openpose-detect-test",
    allow_patterns=["bbox/*", "keypoint/vitpose_l/*"],
)
bbox_session = ort.InferenceSession(
    f"{local}/bbox/model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
kp_session = ort.InferenceSession(
    f"{local}/keypoint/vitpose_l/model.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
decode_params = json.loads(open(f"{local}/bbox/decode_params.json").read())
classes_map   = json.loads(open(f"{local}/bbox/classes.json").read())
kp_meta       = json.loads(open(f"{local}/keypoint/vitpose_l/meta.json").read())

# 1. Stage 1: letterbox -> normalise -> RTMDet -> decode + NMS -> {class: bbox}
# 2. Stage 2: top-down crop full/body bbox -> ViTPose -> heatmap argmax
# 3. coco17 -> dwpose25 lift, toe synthesis, hand template (see spec doc)

Caveats

  • Six characters only: the training set is fine-tuned to specifically six original characters. Recall on visually different mascots is noticeably worse β€” this is a known-bad baseline that the upcoming generic release is meant to fix.
  • Hand and foot boxes are deliberately drawn to the outer silhouette for characters whose fingers or toes are not rendered. This keeps Stage 2 keypoint placement and the hand-template fitter stable, but it means the boxes are not anatomically tight.
  • Hand keypoints are deliberately not produced by the keypoint model. The pipeline computes them separately using the shoulder-elbow-wrist vector and a neutral 21-point hand template scaled to the bounding box from Stage 1.
  • ONNX export uses static input shapes. Batched inference requires re-exporting with dynamic axes.

License

CC-BY-NC 4.0 β€” non-commercial use only. See the LICENSE file in this repo. Per-component breakdown above clarifies which parts (RTMDet bbox + RTMPose-s) would be Apache 2.0 if shipped separately; the future grmchn/mascot-openpose-detect (without the MAE-derived ViTPose-L) will ship as Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support