---
license: apache-2.0
tags:
  - object-detection
  - keypoint-detection
  - pose-estimation
  - onnx
  - dinov2
  - vitpose
  - rtmdet
  - mascot
  - chibi
  - kemono
library_name: onnx
base_model:
  - facebook/dinov2-large
inference: false
---

# mascot-pose-detect

Two-stage mascot pose detector for chibi, kemono, and other stylized mascot characters.

This repository provides ONNX artifacts for portable inference:

1. Stage 1: a 7-class RTMDet-tiny bounding-box detector.
2. Stage 2: a DINOv2-Large backbone with a ViTPose-style COCO-17 heatmap head.

The keypoint model is fine-tuned for stylized mascot bodies whose proportions differ strongly from real human pose datasets.
The consumer should lift the COCO-17 keypoints to DWPose-25 / POSE_KEYPOINT format and derive toe points from the foot bounding boxes.
Hand keypoints are expected to be generated by a separate hand-template fitter when required.

## License

This model package is released under the Apache License 2.0.

The Stage 2 keypoint model is based on `facebook/dinov2-large`, which is also released under Apache 2.0.
It does not use the older MAE-pretrained ViTPose-L checkpoint that constrained the previous test bundle to non-commercial use.

The training annotations and source images are not included in this repository.

## Repository Contents

```text
grmchn/mascot-pose-detect/
├── bbox/
│   ├── model.onnx
│   ├── classes.json
│   └── decode_params.json
└── keypoint/
    ├── dinov2_vitpose_l/
    │   ├── model.onnx
    │   └── meta.json
    └── dinov2_vitpose_l_v2/
        ├── model.onnx
        └── meta.json
```

## Stage 1: BBox Detector

The bbox model detects seven mascot body regions:

| index | name |
|------:|------|
| 0 | `full` |
| 1 | `head` |
| 2 | `body` |
| 3 | `hand_left` |
| 4 | `hand_right` |
| 5 | `foot_left` |
| 6 | `foot_right` |

Left and right follow anatomical / character-view naming.
For a front-facing character, the character's right side usually appears on the screen-left side.

## Stage 2: Keypoint Detector

The keypoint model input is a top-down crop from the Stage 1 `full` or `body` bbox.

## Model Versions

| version | keypoint path | training run | status |
|---|---|---|---|
| v1 | `keypoint/dinov2_vitpose_l/` | `general_filtered` | Stable baseline release |
| v2 | `keypoint/dinov2_vitpose_l_v2/` | `final_v3_from_final_v2` | Updated model with additional hard-example training |

Both versions use the same architecture, input shape, output heatmap shape, and post-processing contract.
Switching from v1 to v2 only requires changing the keypoint variant path from `dinov2_vitpose_l` to `dinov2_vitpose_l_v2`.

| field | value |
|---|---|
| Architecture | `dinov2_vitpose_l` |
| Backbone | `facebook/dinov2-large` |
| Input | `1x3x224x168` NCHW RGB, ImageNet-normalized |
| Output | `heatmap` |
| Keypoint layout | COCO-17 |
| Post-process layout | DWPose-25 / POSE_KEYPOINT-compatible |

See each version's `meta.json` for exact input size, normalization values, output names, and post-processing notes.

## Download

```python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="grmchn/mascot-pose-detect",
    allow_patterns=[
        "bbox/*",
        "keypoint/dinov2_vitpose_l_v2/*",
    ],
)
```

## Notes

This is not an OpenPose implementation and does not include OpenPose weights.
It produces keypoints that can be converted into an OpenPose-compatible JSON schema for downstream tools.

The model was trained for stylized mascot characters.
It may not generalize to realistic human photos without additional fine-tuning.