ERGON — 6DoF Pose Estimation (ANIMA Module)

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield (NVIDIA) CVPR 2024 | arXiv:2312.08344

Architecture

Render-and-compare architecture for 6DoF pose estimation:

Observed crop → DINOv2 ViT-B/14 (frozen) → obs_features ─┐
                                                           ├→ Concat → PoseScorer (good/bad)
Rendered crop → DINOv2 ViT-B/14 (frozen) → ren_features ─┤
                                                           └→ Concat → PoseRefiner (SE3 delta)

Component	Params	Details
DINOv2 ViT-B/14	86.6M (frozen)	768-dim CLS token, ImageNet pretrained
PoseScorer	0.427M	MLP: 1536→256→128→1 (BCE + ranking loss)
PoseRefiner	0.427M	MLP: 1536→256→128→6 (geodesic + L1 loss)
Total trainable	0.854M	Scorer + Refiner only

Exported Formats

Format	File	Size	Use Case
PyTorch (.pth)	`pytorch/ergon_v1.pth`	~3.4 MB	Training, fine-tuning
SafeTensors	`pytorch/ergon_v1.safetensors`	~3.4 MB	Fast loading, safe
ONNX (opset 17)	`onnx/ergon_v1.onnx`	~350 MB	Cross-platform inference
TensorRT FP16	`tensorrt/ergon_v1_fp16.trt`	varies	Edge (Jetson/L4)
TensorRT FP32	`tensorrt/ergon_v1_fp32.trt`	varies	Full precision

Usage

PyTorch

import torch
from ergon.mlx_backend_cuda import build_cuda_model

model = build_cuda_model(
    backbone_weights="dinov2_vitb14_pretrain.pth",
    checkpoint_path="pytorch/ergon_v1.pth",
    device="cuda",
)
model.eval()

# Score a pose hypothesis
score = model.score(observed_crop, rendered_crop)  # [B, 1] probability

# Refine a pose
delta = model.refine(observed_crop, rendered_crop)  # [B, 6] (rotvec + translation)

ONNX Runtime

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("onnx/ergon_v1.onnx")
logits, deltas = sess.run(None, {
    "observed": observed_np,   # [B, 3, 224, 224] float32
    "rendered": rendered_np,   # [B, 3, 224, 224] float32
})

TensorRT (Jetson)

# Build engine on target hardware (TRT engines are NOT portable)
trtexec --onnx=onnx/ergon_v1.onnx \
    --saveEngine=tensorrt/ergon_v1_fp16.trt --fp16

Training

Parameter	Value
Dataset	YCB-Video train_synt (481K frames)
Split	90% train / 5% val / 5% test
Optimizer	AdamW (lr=3e-4, weight_decay=1e-4)
Schedule	Cosine annealing + 5% warmup
Precision	bf16 mixed precision
Phases	Scorer (10ep) → Refiner (10ep) → Joint (20ep)
Hardware	NVIDIA L4 (23GB VRAM)
Best val loss	5.105930995433889
Steps	5
Training time	0.1h
Config	See `configs/train.yaml`

Early Stopping

Patience: 10 epochs
Plateau LR reduction: factor=0.5 after 5 stagnant epochs

Evaluation

Target metrics (BOP benchmark):

Dataset	Paper AR	Our Target (≥90%)
YCB-Video	84.8	≥76.3
T-LESS	67.6	≥60.8

Files

├── README.md                    (this file)
├── pytorch/
│   ├── ergon_v1.pth          (PyTorch state dict)
│   └── ergon_v1.safetensors  (SafeTensors)
├── onnx/
│   └── ergon_v1.onnx         (ONNX opset 17)
├── tensorrt/
│   ├── ergon_v1_fp16.trt     (TensorRT FP16)
│   └── ergon_v1_fp32.trt     (TensorRT FP32)
├── checkpoints/
│   └── best.pth                 (resume training)
├── configs/
│   └── train.yaml               (training config)
└── logs/
    └── training_history.json    (loss curves)

Citation

@inproceedings{wen2024foundationpose,
  title={FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects},
  author={Wen, Bowen and Yang, Wei and Kautz, Jan and Birchfield, Stan},
  booktitle={CVPR},
  year={2024}
}

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ilessio-aiflowlab/project_ergon

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Paper • 2312.08344 • Published Dec 13, 2023 • 13