URD — UniScale Unified Scale-Aware Multi-View 3D Reconstruction

Part of the ANIMA Perception Suite by Robot Flow Labs.

Paper

UniScale: Unified Scale-Aware Multi-View 3D Reconstruction arXiv: 2602.23224

Architecture

UniScale combines camera intrinsics, extrinsics, metric depth, and 3D point cloud generation into a single neural network forward pass. The core design leverages frozen DINOv2 ViT-B/14 foundation model features with lightweight scale-aware pose and depth decoders, enabling metrically consistent multi-view 3D reconstruction without iterative optimization (no RANSAC, no bundle adjustment).

Key components:

Foundation Encoder: Frozen DINOv2 ViT-B/14 (86M params)
Scale-Aware Pose Decoder: Estimates intrinsics + extrinsics + metric scale
Metric Depth Generator: Dense depth maps with confidence estimation
Point Cloud Generator: Direct 3D point maps in metric world coordinates

Exported Formats

Format	File	Size	Use Case
PyTorch (.pth)	`pytorch/urd_v1.pth`	~1.0 GB	Training, fine-tuning, resume
SafeTensors	`pytorch/urd_v1.safetensors`	~347 MB	Fast loading, safe deserialization
ONNX	`onnx/urd_v1.onnx`	~347 MB	Cross-platform inference
TensorRT FP16	`tensorrt/urd_v1_fp16.engine`	~177 MB	Edge deployment (Jetson/L4)
TensorRT FP32	`tensorrt/urd_v1_fp32.engine`	~355 MB	Full precision inference

Training

Dataset: NYU Depth V2 (654 train / 654 val)
Hardware: NVIDIA L4 (23GB VRAM)
Checkpoint: final.pth (epoch 30/30)
Stages: 2-stage curriculum
- Stage 1 (epochs 1-5): Frozen encoder, batch=64, lr=1e-4
- Stage 2 (epochs 6-30): Unfrozen encoder, batch=16, lr=1e-5, gradient checkpointing
Best val_loss: 0.1175 (epoch 18)
Training time: ~97 minutes
Optimizer: AdamW (weight_decay=0.01)
Scheduler: Cosine annealing with 2-epoch warmup
Seed: 42

See configs/ for full hyperparameters and logs/training_history.json for loss curves.

Usage

import torch
from anima_urd.model import UniScale

# Load from checkpoint
model = UniScale.load("pytorch/urd_v1.pth", device="cuda")
model.eval()

# Inference: 4 multi-view images at 512x512
images = torch.randn(1, 4, 3, 512, 512, device="cuda")
with torch.no_grad():
    output = model(images)
    depth = output.depth_maps           # [1, 4, 512, 512] metric depth (meters)
    confidence = output.depth_confidence # [1, 4, 512, 512]
    intrinsics = output.intrinsics      # [1, 3, 3]
    scale = output.scale_factors        # [1] metric scale

Multi-Robot Deployment

UniScale is designed for multi-robot coordination:

Each robot runs feed-forward inference locally (no iterative optimization)
Predicted metric scale enables direct point cloud merging across robots
Linear scaling O(NM) vs quadratic O(N^2M^2) for traditional BA

Citation

@article{UniScale_2026,
  title={UniScale Unified Scale-Aware Multi-View 3D Reconstruction},
  year={2026},
  eprint={2602.23224},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.23224}
}

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ilessio-aiflowlab/project_urd

UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Paper • 2602.23224 • Published Feb 26