FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Paper β’ 2312.08344 β’ Published β’ 13
Part of the ANIMA Perception Suite by Robot Flow Labs.
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield (NVIDIA) CVPR 2024 | arXiv:2312.08344
Render-and-compare architecture for 6DoF pose estimation:
Observed crop β DINOv2 ViT-B/14 (frozen) β obs_features ββ
ββ Concat β PoseScorer (good/bad)
Rendered crop β DINOv2 ViT-B/14 (frozen) β ren_features ββ€
ββ Concat β PoseRefiner (SE3 delta)
| Component | Params | Details |
|---|---|---|
| DINOv2 ViT-B/14 | 86.6M (frozen) | 768-dim CLS token, ImageNet pretrained |
| PoseScorer | 0.427M | MLP: 1536β256β128β1 (BCE + ranking loss) |
| PoseRefiner | 0.427M | MLP: 1536β256β128β6 (geodesic + L1 loss) |
| Total trainable | 0.854M | Scorer + Refiner only |
| Format | File | Size | Use Case |
|---|---|---|---|
| PyTorch (.pth) | pytorch/ergon_v1.pth |
~3.4 MB | Training, fine-tuning |
| SafeTensors | pytorch/ergon_v1.safetensors |
~3.4 MB | Fast loading, safe |
| ONNX (opset 17) | onnx/ergon_v1.onnx |
~350 MB | Cross-platform inference |
| TensorRT FP16 | tensorrt/ergon_v1_fp16.trt |
varies | Edge (Jetson/L4) |
| TensorRT FP32 | tensorrt/ergon_v1_fp32.trt |
varies | Full precision |
import torch
from ergon.mlx_backend_cuda import build_cuda_model
model = build_cuda_model(
backbone_weights="dinov2_vitb14_pretrain.pth",
checkpoint_path="pytorch/ergon_v1.pth",
device="cuda",
)
model.eval()
# Score a pose hypothesis
score = model.score(observed_crop, rendered_crop) # [B, 1] probability
# Refine a pose
delta = model.refine(observed_crop, rendered_crop) # [B, 6] (rotvec + translation)
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("onnx/ergon_v1.onnx")
logits, deltas = sess.run(None, {
"observed": observed_np, # [B, 3, 224, 224] float32
"rendered": rendered_np, # [B, 3, 224, 224] float32
})
# Build engine on target hardware (TRT engines are NOT portable)
trtexec --onnx=onnx/ergon_v1.onnx \
--saveEngine=tensorrt/ergon_v1_fp16.trt --fp16
| Parameter | Value |
|---|---|
| Dataset | YCB-Video train_synt (481K frames) |
| Split | 90% train / 5% val / 5% test |
| Optimizer | AdamW (lr=3e-4, weight_decay=1e-4) |
| Schedule | Cosine annealing + 5% warmup |
| Precision | bf16 mixed precision |
| Phases | Scorer (10ep) β Refiner (10ep) β Joint (20ep) |
| Hardware | NVIDIA L4 (23GB VRAM) |
| Best val loss | 5.105930995433889 |
| Steps | 5 |
| Training time | 0.1h |
| Config | See configs/train.yaml |
Target metrics (BOP benchmark):
| Dataset | Paper AR | Our Target (β₯90%) |
|---|---|---|
| YCB-Video | 84.8 | β₯76.3 |
| T-LESS | 67.6 | β₯60.8 |
βββ README.md (this file)
βββ pytorch/
β βββ ergon_v1.pth (PyTorch state dict)
β βββ ergon_v1.safetensors (SafeTensors)
βββ onnx/
β βββ ergon_v1.onnx (ONNX opset 17)
βββ tensorrt/
β βββ ergon_v1_fp16.trt (TensorRT FP16)
β βββ ergon_v1_fp32.trt (TensorRT FP32)
βββ checkpoints/
β βββ best.pth (resume training)
βββ configs/
β βββ train.yaml (training config)
βββ logs/
βββ training_history.json (loss curves)
@inproceedings{wen2024foundationpose,
title={FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects},
author={Wen, Bowen and Yang, Wei and Kautz, Jan and Birchfield, Stan},
booktitle={CVPR},
year={2024}
}
Apache 2.0 β Robot Flow Labs / AIFLOW LABS LIMITED