metadata
license: apache-2.0
language:
- en
tags:
- depth-estimation
- depth-completion
- rgb-d
- computer-vision
- robotics
- 3d-vision
- pytorch
- vision-transformer
datasets:
- custom
library_name: pytorch
pipeline_tag: depth-estimation
LingBot-Depth: Masked Depth Modeling for Spatial Perception
LingBot-Depth transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. By jointly aligning RGB appearance and depth geometry in a unified latent space, our model serves as a powerful spatial perception foundation for robot learning and 3D vision applications.
Available Models
| Model | Hugging Face Model | ModelScope Model | Description |
|---|---|---|---|
| LingBot-Depth | robbyant/lingbot-depth-pretrain-vitl-14 | robbyant/lingbot-depth-pretrain-vitl-14 | General-purpose depth refinement |
| LingBot-Depth-DC | robbyant/lingbot-depth-postrain-dc-vitl14 | robbyant/lingbot-depth-postrain-dc-vitl14 | Optimized for sparse depth completion |
Quick Start
import torch
from mdm.model.v2 import MDMModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# For general depth refinement
model = MDMModel.from_pretrained('robbyant/lingbot-depth-pretrain-vitl-14').to(device)
# For sparse depth completion (e.g., SfM inputs)
model = MDMModel.from_pretrained('robbyant/lingbot-depth-postrain-dc-vitl14').to(device)
Model Overview
LingBot-Depth (Pretrained)
The general-purpose model trained on 10M RGB-D samples for:
- Depth completion from RGB-D sensor inputs
- Depth refinement for noisy measurements
- Point cloud generation
LingBot-Depth-DC (Depth Completion)
Post-trained variant optimized for sparse depth completion:
- Recovering dense depth from SfM/SLAM sparse points
- Handling extremely sparse inputs (<5% valid pixels)
- RGB-guided depth densification
Key Features
- Masked Depth Modeling: Self-supervised pre-training via depth reconstruction
- Cross-Modal Attention: Joint RGB-Depth alignment in unified latent space
- Metric-Scale Preservation: Maintains real-world measurements for downstream tasks
Architecture
- Encoder: ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
- Decoder: ConvStack decoder with hierarchical upsampling
- Model size: ~300M parameters
Links
- GitHub: https://github.com/robbyant/lingbot-depth
- Paper: Masked Depth Modeling for Spatial Perception
- Project Page: https://technology.robbyant.com/lingbot-depth
Citation
@article{lingbot-depth2026,
title={Masked Depth Modeling for Spatial Perception},
author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
journal={arXiv preprint arXiv:2601.17895},
year={2026}
}
License
Apache License 2.0