File size: 4,913 Bytes
0aed605
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
language: en
license: mit
tags:
- pose-estimation
- computer-vision
- keypoint-detection
- diffusion-models
- stable-diffusion
- out-of-distribution
- human-pose
- top-down-pose-estimation
- coco
- mmpose
library_name: pytorch
---

# SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation (Body - 17 Keypoints)

<div align="center">

[![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2509.24980)
[![Project Page](https://img.shields.io/badge/Project-Website-pink?logo=googlechrome&logoColor=white)](https://t-s-liang.github.io/SDPose)
[![HuggingFace Demo](https://img.shields.io/badge/🤗%20HuggingFace-Demo-yellow)](https://huggingface.co/spaces/teemosliang/SDPose-Body)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

</div>

## Model Description

**SDPose** is a state-of-the-art human pose estimation model that leverages the powerful visual priors from **Stable Diffusion** to achieve exceptional performance on out-of-distribution (OOD) scenarios. This model variant estimates **17 COCO body keypoints** including nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles.

### Model Architecture

SDPose employs a **U-Net backbone** initialized with Stable Diffusion v2 weights, combined with a specialized heatmap head for keypoint prediction. The model operates in a top-down manner:

1. **Person Detection**: Detect human bounding boxes using an object detector (e.g., YOLO11-x)
2. **Pose Estimation**: Crop and estimate 17 body keypoints for each detected person
3. **Heatmap Generation**: Produce confidence heatmaps for precise keypoint estimation

**Model Specifications:**
- **Backbone**: Stable Diffusion v2 U-Net (fine-tuned; minimal architectural changes)
- **Head**: Custom heatmap prediction head
- **Input Resolution**: 1024×768 (H×W)
- **Output**: 17 keypoint heatmaps + coordinates with confidence scores
- **Framework**: MMPose

## Supported Keypoints (COCO Format)

The model predicts 17 body keypoints following the COCO keypoint format:

```
0: nose
1: left_eye
2: right_eye
3: left_ear
4: right_ear
5: left_shoulder
6: right_shoulder
7: left_elbow
8: right_elbow
9: left_wrist
10: right_wrist
11: left_hip
12: right_hip
13: left_knee
14: right_knee
15: left_ankle
16: right_ankle
```

## Intended Use

### Primary Use Cases

- Human pose estimation in natural images
- Pose estimation in artistic and stylized domains (paintings, anime, sketches)
- Animation and video pose tracking
- Cross-domain pose analysis and research
- Applications requiring robust pose estimation under distribution shifts

## How to Use

### Installation

```bash
# Clone the repository
git clone https://github.com/t-s-liang/SDPose-OOD.git
cd SDPose-OOD

# Install dependencies
pip install -r requirements.txt
# Download YOLO11-x for human detection
wget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11x.pt -P models/

# Launch Gradio interface
cd gradio_app
bash launch_gradio.sh
```

## Training Data

### Datasets

Trained exclusively on COCO-2017 train2017 (no extra data).

- **COCO (Common Objects in Context)**: 200K+ images with 17 body keypoints

### Preprocessing

- Images are resized and cropped to 1024×768 resolution
- Augmentation: random horizontal flip, half-body & bbox transforms, UDP affine; Albumentations (Gaussian/Median blur, coarse dropout).
- Heatmaps: UDP codec (MMPose style).

### Comparison with Baselines

SDPose significantly outperforms traditional pose estimation models (e.g., Sapiens, ViTPose++) on out-of-distribution benchmarks while maintaining competitive performance on in-domain data.

See our [paper](https://arxiv.org/abs/2509.24980) for comprehensive evaluation results.

## Citation

If you use SDPose in your research, please cite our paper:

```bibtex
@misc{liang2025sdposeexploitingdiffusionpriors,
      title={SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation}, 
      author={Shuang Liang and Jing He and Chuanmeizhi Wang and Lejun Liao and Guo Zhang and Yingcong Chen and Yuan Yuan},
      year={2025},
      eprint={2509.24980},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.24980}, 
}
```

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).

## Additional Resources

- 🌐 **Project Website**: [https://t-s-liang.github.io/SDPose](https://t-s-liang.github.io/SDPose)
- 📄 **Paper**: [arXiv:2509.24980](https://arxiv.org/abs/2509.24980)
- 💻 **Code Repository**: [GitHub](https://github.com/t-s-liang/SDPose-OOD)
- 🤗 **Demo**: [HuggingFace Space](https://huggingface.co/spaces/teemosliang/SDPose-Body)
- 📧 **Contact**: tsliang2001@gmail.com

---

<div align="center">

**⭐ Star us on GitHub — it motivates us a lot!**

</div>