ERGON β€” 6DoF Pose Estimation (ANIMA Module)

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects Bowen Wen, Wei Yang, Jan Kautz, Stan Birchfield (NVIDIA) CVPR 2024 | arXiv:2312.08344

Architecture

Render-and-compare architecture for 6DoF pose estimation:

Observed crop β†’ DINOv2 ViT-B/14 (frozen) β†’ obs_features ─┐
                                                           β”œβ†’ Concat β†’ PoseScorer (good/bad)
Rendered crop β†’ DINOv2 ViT-B/14 (frozen) β†’ ren_features ──
                                                           β””β†’ Concat β†’ PoseRefiner (SE3 delta)
Component Params Details
DINOv2 ViT-B/14 86.6M (frozen) 768-dim CLS token, ImageNet pretrained
PoseScorer 0.427M MLP: 1536β†’256β†’128β†’1 (BCE + ranking loss)
PoseRefiner 0.427M MLP: 1536β†’256β†’128β†’6 (geodesic + L1 loss)
Total trainable 0.854M Scorer + Refiner only

Exported Formats

Format File Size Use Case
PyTorch (.pth) pytorch/ergon_v1.pth ~3.4 MB Training, fine-tuning
SafeTensors pytorch/ergon_v1.safetensors ~3.4 MB Fast loading, safe
ONNX (opset 17) onnx/ergon_v1.onnx ~350 MB Cross-platform inference
TensorRT FP16 tensorrt/ergon_v1_fp16.trt varies Edge (Jetson/L4)
TensorRT FP32 tensorrt/ergon_v1_fp32.trt varies Full precision

Usage

PyTorch

import torch
from ergon.mlx_backend_cuda import build_cuda_model

model = build_cuda_model(
    backbone_weights="dinov2_vitb14_pretrain.pth",
    checkpoint_path="pytorch/ergon_v1.pth",
    device="cuda",
)
model.eval()

# Score a pose hypothesis
score = model.score(observed_crop, rendered_crop)  # [B, 1] probability

# Refine a pose
delta = model.refine(observed_crop, rendered_crop)  # [B, 6] (rotvec + translation)

ONNX Runtime

import onnxruntime as ort
import numpy as np

sess = ort.InferenceSession("onnx/ergon_v1.onnx")
logits, deltas = sess.run(None, {
    "observed": observed_np,   # [B, 3, 224, 224] float32
    "rendered": rendered_np,   # [B, 3, 224, 224] float32
})

TensorRT (Jetson)

# Build engine on target hardware (TRT engines are NOT portable)
trtexec --onnx=onnx/ergon_v1.onnx \
    --saveEngine=tensorrt/ergon_v1_fp16.trt --fp16

Training

Parameter Value
Dataset YCB-Video train_synt (481K frames)
Split 90% train / 5% val / 5% test
Optimizer AdamW (lr=3e-4, weight_decay=1e-4)
Schedule Cosine annealing + 5% warmup
Precision bf16 mixed precision
Phases Scorer (10ep) β†’ Refiner (10ep) β†’ Joint (20ep)
Hardware NVIDIA L4 (23GB VRAM)
Best val loss 5.105930995433889
Steps 5
Training time 0.1h
Config See configs/train.yaml

Early Stopping

  • Patience: 10 epochs
  • Plateau LR reduction: factor=0.5 after 5 stagnant epochs

Evaluation

Target metrics (BOP benchmark):

Dataset Paper AR Our Target (β‰₯90%)
YCB-Video 84.8 β‰₯76.3
T-LESS 67.6 β‰₯60.8

Files

β”œβ”€β”€ README.md                    (this file)
β”œβ”€β”€ pytorch/
β”‚   β”œβ”€β”€ ergon_v1.pth          (PyTorch state dict)
β”‚   └── ergon_v1.safetensors  (SafeTensors)
β”œβ”€β”€ onnx/
β”‚   └── ergon_v1.onnx         (ONNX opset 17)
β”œβ”€β”€ tensorrt/
β”‚   β”œβ”€β”€ ergon_v1_fp16.trt     (TensorRT FP16)
β”‚   └── ergon_v1_fp32.trt     (TensorRT FP32)
β”œβ”€β”€ checkpoints/
β”‚   └── best.pth                 (resume training)
β”œβ”€β”€ configs/
β”‚   └── train.yaml               (training config)
└── logs/
    └── training_history.json    (loss curves)

Citation

@inproceedings{wen2024foundationpose,
  title={FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects},
  author={Wen, Bowen and Yang, Wei and Kautz, Jan and Birchfield, Stan},
  booktitle={CVPR},
  year={2024}
}

License

Apache 2.0 β€” Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for ilessio-aiflowlab/project_ergon