File size: 5,273 Bytes

---
license: apache-2.0
tags:
- pose-estimation
- 3d-pose
- computer-vision
- pytorch
- rtmpose
datasets:
- cocktail14
metrics:
- mpjpe
library_name: pytorch
---

# RTMPose3D

Real-time multi-person 3D whole-body pose estimation with 133 keypoints per person.

## Model Description

RTMPose3D is a real-time 3D pose estimation model that detects and tracks 133 keypoints per person:
- **17** body keypoints (COCO format)
- **6** foot keypoints  
- **68** facial landmarks
- **42** hand keypoints (21 per hand)

The model outputs both 2D pixel coordinates and 3D spatial coordinates for each keypoint.

## Model Variants

This repository contains checkpoints for:

| Model | Parameters | Speed | Accuracy (MPJPE) | Checkpoint File |
|-------|------------|-------|------------------|-----------------|
| RTMDet-M (Detector) | ~50M | Fast | - | `rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth` |
| RTMW3D-L (Large) | ~65M | Real-time | 0.678 | `rtmw3d-l_8xb64_cocktail14-384x288-794dbc78_20240626.pth` |
| RTMW3D-X (Extra Large) | ~98M | Slower | 0.680 | `rtmw3d-x_8xb64_cocktail14-384x288-b0a0eab7_20240626.pth` |

The model outputs both 2D pixel coordinates and 3D spatial coordinates for each keypoint.

## Model Variants

This repository contains checkpoints for:

| Model | Parameters | Speed | Accuracy (MPJPE) | Checkpoint File |
|-------|------------|-------|------------------|-----------------|
| RTMDet-M (Detector) | ~50M | Fast | - | `rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pth` |
| RTMW3D-L (Large) | ~65M | Real-time | 0.045 | `rtmw3d-l_cock14-0d4ad840_20240422.pth` |
| RTMW3D-X (Extra Large) | ~98M | Slower | 0.057 | `rtmw3d-x_8xb64_cocktail14-384x288-b0a0eab7_20240626.pth` |

## Installation

```bash
pip install git+https://github.com/b-arac/rtmpose3d.git
```

Or clone and install locally:

```bash
git clone https://github.com/b-arac/rtmpose3d.git
cd rtmpose3d
pip install -r requirements.txt
pip install -e .
```

## Quick Start

### Using the HuggingFace Transformers-style API

```python
import cv2
from rtmpose3d import RTMPose3D

# Initialize model (auto-downloads checkpoints from this repo)
model = RTMPose3D.from_pretrained('rbarac/rtmpose3d', device='cuda:0')

# Run inference
image = cv2.imread('person.jpg')
results = model(image, return_tensors='np')

# Access results
keypoints_3d = results['keypoints_3d']  # [N, 133, 3] - 3D coords in meters
keypoints_2d = results['keypoints_2d']  # [N, 133, 2] - pixel coords
scores = results['scores']              # [N, 133] - confidence [0, 1]
```

### Using the Simple Inference API

```python
from rtmpose3d import RTMPose3DInference

# Initialize with model size
model = RTMPose3DInference(model_size='l', device='cuda:0')  # or 'x' for extra large

# Run inference
results = model(image)
print(results['keypoints_3d'].shape)  # [N, 133, 3]
```

### Single Person Detection

Detect only the most prominent person in the image:

```python
# Works with both APIs
results = model(image, single_person=True)  # Returns only N=1
```

## Output Format

```python
{
    'keypoints_3d': np.ndarray,  # [N, 133, 3] - (X, Y, Z) in meters
    'keypoints_2d': np.ndarray,  # [N, 133, 2] - (x, y) pixel coordinates
    'scores': np.ndarray,        # [N, 133] - confidence scores [0, 1]
    'bboxes': np.ndarray         # [N, 4] - bounding boxes [x1, y1, x2, y2]
}
```

Where `N` is the number of detected persons.

### Coordinate Systems

**2D Keypoints** - Pixel coordinates:
- X: horizontal position [0, image_width]
- Y: vertical position [0, image_height]

**3D Keypoints** - Camera-relative coordinates in meters (Z-up convention):
- X: horizontal (negative=left, positive=right)
- Y: depth (negative=closer, positive=farther)
- Z: vertical (negative=down, positive=up)

## Keypoint Indices

| Index Range | Body Part | Count | Description |
|-------------|-----------|-------|-------------|
| 0-16 | Body | 17 | Nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles |
| 17-22 | Feet | 6 | Foot keypoints |
| 23-90 | Face | 68 | Facial landmarks |
| 91-111 | Left Hand | 21 | Left hand keypoints |
| 112-132 | Right Hand | 21 | Right hand keypoints |

## Training Data

The models were trained on the **Cocktail14** dataset, which combines 14 public 3D pose datasets:
- Human3.6M
- COCO-WholeBody
- UBody
- And 11 more datasets

## Performance

Evaluated on standard 3D pose benchmarks:

- **RTMW3D-L**: 0.045 MPJPE, real-time inference (~30 FPS on RTX 3090)
- **RTMW3D-X**: 0.057 MPJPE, slower but higher accuracy

## Requirements

- Python >= 3.8
- PyTorch >= 2.0.0
- CUDA-capable GPU (4GB+ VRAM recommended)
- mmcv >= 2.0.0
- MMPose >= 1.0.0
- MMDetection >= 3.0.0

## Citation

```bibtex
@misc{rtmpose3d2025,
  title={RTMPose3D: Real-Time Multi-Person 3D Pose Estimation},
  author={Arac, Bahadir},
  year={2025},
  publisher={GitHub},
  url={https://github.com/b-arac/rtmpose3d}
}
```

## License

Apache 2.0

## Acknowledgments

Built on [MMPose](https://github.com/open-mmlab/mmpose) by OpenMMLab. Models trained by the MMPose team on the Cocktail14 dataset.

## Links

- **GitHub Repository**: [b-arac/rtmpose3d](https://github.com/b-arac/rtmpose3d)
- **Documentation**: See README in the repository
- **MMPose**: [open-mmlab/mmpose](https://github.com/open-mmlab/mmpose)