ABYSSOS -- Metric Depth Estimation (ANIMA Module)
Part of the ANIMA Perception Suite by Robot Flow Labs.
Paper
PromptDA: Prompt-Guided Depth Anything for Metric Depth Estimation (arXiv:2502.11277)
Based on Depth Anything V2 with sparse LiDAR prompt fusion for metric depth.
Model
- Architecture: PromptDA ViT-S (Vision Transformer Small + DPT Decoder + Gated Cross-Attention Prompt Fusion)
- Parameters: 25.1M
- Input: RGB image (518x518) + sparse depth prompt (518x518)
- Output: Dense metric depth map (518x518)
- Base model:
depth-anything/prompt-depth-anything-vits-hf
Training
| Parameter | Value |
|---|---|
| Hardware | 8x NVIDIA L4 (23GB each) |
| Framework | PyTorch 2.11 + DDP |
| Dataset | NYU Depth V2 (1304 train, 72 val, 73 test) |
| Batch size | 24/GPU, effective 192 |
| Optimizer | AdamW (head lr=5e-5, backbone lr=5e-6) |
| Schedule | 500-step warmup + cosine decay |
| Loss | L1 + 0.5 * gradient (edge-aware) |
| Precision | bf16 mixed precision |
| Steps | 5000 (best at step 250) |
| Training time | 2.5 hours |
Results
| Metric | Value | Target |
|---|---|---|
| Test AbsRel (with prompt) | 0.0203 | < 0.040 |
| Test RMSE | 0.1009 | -- |
| Test delta<1.25 | 0.9954 | -- |
| Val AbsRel (best, step 250) | 0.0192 | -- |
| DAv2 baseline (no prompt) | 0.073 | -- |
3.6x better than DAv2 metric baseline with sparse depth prompts.
Exported Formats
| Format | File | Size | Status |
|---|---|---|---|
| SafeTensors | pytorch/model.safetensors |
96 MB | Available |
| ONNX | onnx/abyssos_v1.onnx |
102 MB | Available |
| TensorRT FP16 | -- | -- | Deferred (bilinear upsample TRT limitation) |
| TensorRT FP32 | -- | -- | Deferred |
| Checkpoint (resume) | checkpoints/ |
287 MB | Available (includes optimizer state) |
Usage
from transformers import AutoModelForDepthEstimation, AutoImageProcessor
import torch
model = AutoModelForDepthEstimation.from_pretrained("ilessio-aiflowlab/project_abyssos", subfolder="pytorch")
processor = AutoImageProcessor.from_pretrained("ilessio-aiflowlab/project_abyssos", subfolder="pytorch")
# Inference with prompt
pixel_values = torch.randn(1, 3, 518, 518) # preprocessed RGB
prompt_depth = torch.randn(1, 1, 518, 518) # sparse depth from LiDAR
output = model(pixel_values=pixel_values, prompt_depth=prompt_depth)
depth_map = output.predicted_depth # (1, 518, 518) metric depth in meters
ONNX Inference
import onnxruntime as ort
import numpy as np
sess = ort.InferenceSession("onnx/abyssos_v1.onnx")
depth = sess.run(None, {
"pixel_values": np.random.randn(1, 3, 518, 518).astype(np.float32),
"prompt_depth": np.random.randn(1, 1, 518, 518).astype(np.float32),
})[0] # (1, 518, 518)
Repository Structure
.
βββ README.md
βββ anima_module.yaml
βββ pytorch/
β βββ model.safetensors
β βββ config.json
β βββ preprocessor_config.json
βββ onnx/
β βββ abyssos_v1.onnx
βββ checkpoints/
β βββ model.safetensors
β βββ training_state.pt
β βββ eval_metrics.json
βββ configs/
β βββ train_8gpu_ddp.yaml
β βββ train_8gpu_dp_safe.yaml
β βββ ...
βββ logs/
βββ train_v3_report.json
βββ training_history.jsonl
License
Apache 2.0 -- Robot Flow Labs / AIFLOW LABS LIMITED
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support