URD โ€” UniScale Unified Scale-Aware Multi-View 3D Reconstruction

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

UniScale: Unified Scale-Aware Multi-View 3D Reconstruction arXiv: 2602.23224

Architecture

UniScale combines camera intrinsics, extrinsics, metric depth, and 3D point cloud generation into a single neural network forward pass. The core design leverages frozen DINOv2 ViT-B/14 foundation model features with lightweight scale-aware pose and depth decoders, enabling metrically consistent multi-view 3D reconstruction without iterative optimization (no RANSAC, no bundle adjustment).

Key components:

  • Foundation Encoder: Frozen DINOv2 ViT-B/14 (86M params)
  • Scale-Aware Pose Decoder: Estimates intrinsics + extrinsics + metric scale
  • Metric Depth Generator: Dense depth maps with confidence estimation
  • Point Cloud Generator: Direct 3D point maps in metric world coordinates

Exported Formats

Format File Size Use Case
PyTorch (.pth) pytorch/urd_v1.pth ~1.0 GB Training, fine-tuning, resume
SafeTensors pytorch/urd_v1.safetensors ~347 MB Fast loading, safe deserialization
ONNX onnx/urd_v1.onnx ~347 MB Cross-platform inference
TensorRT FP16 tensorrt/urd_v1_fp16.engine ~177 MB Edge deployment (Jetson/L4)
TensorRT FP32 tensorrt/urd_v1_fp32.engine ~355 MB Full precision inference

Training

  • Dataset: NYU Depth V2 (654 train / 654 val)
  • Hardware: NVIDIA L4 (23GB VRAM)
  • Checkpoint: final.pth (epoch 30/30)
  • Stages: 2-stage curriculum
    • Stage 1 (epochs 1-5): Frozen encoder, batch=64, lr=1e-4
    • Stage 2 (epochs 6-30): Unfrozen encoder, batch=16, lr=1e-5, gradient checkpointing
  • Best val_loss: 0.1175 (epoch 18)
  • Training time: ~97 minutes
  • Optimizer: AdamW (weight_decay=0.01)
  • Scheduler: Cosine annealing with 2-epoch warmup
  • Seed: 42

See configs/ for full hyperparameters and logs/training_history.json for loss curves.

Usage

import torch
from anima_urd.model import UniScale

# Load from checkpoint
model = UniScale.load("pytorch/urd_v1.pth", device="cuda")
model.eval()

# Inference: 4 multi-view images at 512x512
images = torch.randn(1, 4, 3, 512, 512, device="cuda")
with torch.no_grad():
    output = model(images)
    depth = output.depth_maps           # [1, 4, 512, 512] metric depth (meters)
    confidence = output.depth_confidence # [1, 4, 512, 512]
    intrinsics = output.intrinsics      # [1, 3, 3]
    scale = output.scale_factors        # [1] metric scale

Multi-Robot Deployment

UniScale is designed for multi-robot coordination:

  • Each robot runs feed-forward inference locally (no iterative optimization)
  • Predicted metric scale enables direct point cloud merging across robots
  • Linear scaling O(NM) vs quadratic O(N^2M^2) for traditional BA

Citation

@article{UniScale_2026,
  title={UniScale Unified Scale-Aware Multi-View 3D Reconstruction},
  year={2026},
  eprint={2602.23224},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.23224}
}

License

Apache 2.0 โ€” Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for ilessio-aiflowlab/project_urd