--- license: apache-2.0 tags: - object-detection - keypoint-detection - pose-estimation - onnx - dinov2 - vitpose - rtmdet - mascot - chibi - kemono library_name: onnx base_model: - facebook/dinov2-large inference: false --- # mascot-pose-detect Two-stage mascot pose detector for chibi, kemono, and other stylized mascot characters. This repository provides ONNX artifacts for portable inference: 1. Stage 1: a 7-class RTMDet-tiny bounding-box detector. 2. Stage 2: a DINOv2-Large backbone with a ViTPose-style COCO-17 heatmap head. The keypoint model is fine-tuned for stylized mascot bodies whose proportions differ strongly from real human pose datasets. The consumer should lift the COCO-17 keypoints to DWPose-25 / POSE_KEYPOINT format and derive toe points from the foot bounding boxes. Hand keypoints are expected to be generated by a separate hand-template fitter when required. ## License This model package is released under the Apache License 2.0. The Stage 2 keypoint model is based on `facebook/dinov2-large`, which is also released under Apache 2.0. It does not use the older MAE-pretrained ViTPose-L checkpoint that constrained the previous test bundle to non-commercial use. The training annotations and source images are not included in this repository. ## Repository Contents ```text grmchn/mascot-pose-detect/ ├── bbox/ │ ├── model.onnx │ ├── classes.json │ └── decode_params.json └── keypoint/ ├── dinov2_vitpose_l/ │ ├── model.onnx │ └── meta.json └── dinov2_vitpose_l_v2/ ├── model.onnx └── meta.json ``` ## Stage 1: BBox Detector The bbox model detects seven mascot body regions: | index | name | |------:|------| | 0 | `full` | | 1 | `head` | | 2 | `body` | | 3 | `hand_left` | | 4 | `hand_right` | | 5 | `foot_left` | | 6 | `foot_right` | Left and right follow anatomical / character-view naming. For a front-facing character, the character's right side usually appears on the screen-left side. ## Stage 2: Keypoint Detector The keypoint model input is a top-down crop from the Stage 1 `full` or `body` bbox. ## Model Versions | version | keypoint path | training run | status | |---|---|---|---| | v1 | `keypoint/dinov2_vitpose_l/` | `general_filtered` | Stable baseline release | | v2 | `keypoint/dinov2_vitpose_l_v2/` | `final_v3_from_final_v2` | Updated model with additional hard-example training | Both versions use the same architecture, input shape, output heatmap shape, and post-processing contract. Switching from v1 to v2 only requires changing the keypoint variant path from `dinov2_vitpose_l` to `dinov2_vitpose_l_v2`. | field | value | |---|---| | Architecture | `dinov2_vitpose_l` | | Backbone | `facebook/dinov2-large` | | Input | `1x3x224x168` NCHW RGB, ImageNet-normalized | | Output | `heatmap` | | Keypoint layout | COCO-17 | | Post-process layout | DWPose-25 / POSE_KEYPOINT-compatible | See each version's `meta.json` for exact input size, normalization values, output names, and post-processing notes. ## Download ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="grmchn/mascot-pose-detect", allow_patterns=[ "bbox/*", "keypoint/dinov2_vitpose_l_v2/*", ], ) ``` ## Notes This is not an OpenPose implementation and does not include OpenPose weights. It produces keypoints that can be converted into an OpenPose-compatible JSON schema for downstream tools. The model was trained for stylized mascot characters. It may not generalize to realistic human photos without additional fine-tuning.