mascot-pose-detect / README.md
grmchn's picture
Upload mascot-pose-detect ONNX artifacts
42d844f verified
---
license: apache-2.0
tags:
- object-detection
- keypoint-detection
- pose-estimation
- onnx
- dinov2
- vitpose
- rtmdet
- mascot
- chibi
- kemono
library_name: onnx
base_model:
- facebook/dinov2-large
inference: false
---
# mascot-pose-detect
Two-stage mascot pose detector for chibi, kemono, and other stylized mascot characters.
This repository provides ONNX artifacts for portable inference:
1. Stage 1: a 7-class RTMDet-tiny bounding-box detector.
2. Stage 2: a DINOv2-Large backbone with a ViTPose-style COCO-17 heatmap head.
The keypoint model is fine-tuned for stylized mascot bodies whose proportions differ strongly from real human pose datasets.
The consumer should lift the COCO-17 keypoints to DWPose-25 / POSE_KEYPOINT format and derive toe points from the foot bounding boxes.
Hand keypoints are expected to be generated by a separate hand-template fitter when required.
## License
This model package is released under the Apache License 2.0.
The Stage 2 keypoint model is based on `facebook/dinov2-large`, which is also released under Apache 2.0.
It does not use the older MAE-pretrained ViTPose-L checkpoint that constrained the previous test bundle to non-commercial use.
The training annotations and source images are not included in this repository.
## Repository Contents
```text
grmchn/mascot-pose-detect/
β”œβ”€β”€ bbox/
β”‚ β”œβ”€β”€ model.onnx
β”‚ β”œβ”€β”€ classes.json
β”‚ └── decode_params.json
└── keypoint/
β”œβ”€β”€ dinov2_vitpose_l/
β”‚ β”œβ”€β”€ model.onnx
β”‚ └── meta.json
└── dinov2_vitpose_l_v2/
β”œβ”€β”€ model.onnx
└── meta.json
```
## Stage 1: BBox Detector
The bbox model detects seven mascot body regions:
| index | name |
|------:|------|
| 0 | `full` |
| 1 | `head` |
| 2 | `body` |
| 3 | `hand_left` |
| 4 | `hand_right` |
| 5 | `foot_left` |
| 6 | `foot_right` |
Left and right follow anatomical / character-view naming.
For a front-facing character, the character's right side usually appears on the screen-left side.
## Stage 2: Keypoint Detector
The keypoint model input is a top-down crop from the Stage 1 `full` or `body` bbox.
## Model Versions
| version | keypoint path | training run | status |
|---|---|---|---|
| v1 | `keypoint/dinov2_vitpose_l/` | `general_filtered` | Stable baseline release |
| v2 | `keypoint/dinov2_vitpose_l_v2/` | `final_v3_from_final_v2` | Updated model with additional hard-example training |
Both versions use the same architecture, input shape, output heatmap shape, and post-processing contract.
Switching from v1 to v2 only requires changing the keypoint variant path from `dinov2_vitpose_l` to `dinov2_vitpose_l_v2`.
| field | value |
|---|---|
| Architecture | `dinov2_vitpose_l` |
| Backbone | `facebook/dinov2-large` |
| Input | `1x3x224x168` NCHW RGB, ImageNet-normalized |
| Output | `heatmap` |
| Keypoint layout | COCO-17 |
| Post-process layout | DWPose-25 / POSE_KEYPOINT-compatible |
See each version's `meta.json` for exact input size, normalization values, output names, and post-processing notes.
## Download
```python
from huggingface_hub import snapshot_download
local_dir = snapshot_download(
repo_id="grmchn/mascot-pose-detect",
allow_patterns=[
"bbox/*",
"keypoint/dinov2_vitpose_l_v2/*",
],
)
```
## Notes
This is not an OpenPose implementation and does not include OpenPose weights.
It produces keypoints that can be converted into an OpenPose-compatible JSON schema for downstream tools.
The model was trained for stylized mascot characters.
It may not generalize to realistic human photos without additional fine-tuning.