LingBot-Depth (Pretrained)

LingBot-Depth transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. This is the general-purpose pretrained model for depth refinement tasks.

Model Details

Model Description

LingBot-Depth employs a masked depth modeling (MDM) approach that treats missing depth measurements from RGB-D sensors not as noise, but as a natural masking signal that highlights geometric ambiguities. The model learns joint representations from RGB appearance context and valid depth observations, enabling robust depth reasoning under incomplete observations.

Developed by: Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue
Model type: Vision Transformer for depth completion and refinement
License: Apache 2.0

Model Sources

Repository: https://github.com/robbyant/lingbot-depth
Paper: Masked Depth Modeling for Spatial Perception
Project Page: https://technology.robbyant.com/lingbot-depth

Related Models

Model	Hugging Face Model	ModelScope Model	Description
LingBot-Depth	robbyant/lingbot-depth-pretrain-vitl-14	robbyant/lingbot-depth-pretrain-vitl-14	General-purpose depth refinement
LingBot-Depth-DC	robbyant/lingbot-depth-postrain-dc-vitl14	robbyant/lingbot-depth-postrain-dc-vitl14	Optimized for sparse depth completion

Uses

Direct Use

Depth Completion: Filling missing regions in raw RGB-D sensor depth maps with metric accuracy
Depth Refinement: Improving noisy depth measurements from consumer-grade depth cameras
Point Cloud Generation: Producing clean 3D point clouds from RGB-D inputs

Downstream Use

Scene Reconstruction: High-fidelity indoor mapping with strong depth priors
4D Point Tracking: Accurate dynamic tracking in metric space for robot learning
Dexterous Manipulation: Robust robotic grasping with precise geometric understanding
Monocular Depth Estimation: As a pretrained backbone for depth estimation models
Stereo Matching: As a depth prior for stereo matching networks (e.g., FoundationStereo)

Technical Specifications

Model Architecture

Encoder: ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
Decoder: ConvStack decoder with hierarchical upsampling
Objective: Masked depth modeling
Model size: ~300M parameters

Software Requirements

Python >= 3.9
PyTorch >= 2.0.0
xformers

Citation

@article{lingbot-depth2026,
  title={Masked Depth Modeling for Spatial Perception},
  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
  journal={arXiv preprint arXiv:2601.17895},
  year={2026}
}

Model Card Contact

Email: tanbin.tan@antgroup.com, xuenan.xue@antgroup.com
Issues: https://github.com/robbyant/lingbot-depth/issues

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including robbyant/lingbot-depth-pretrain-vitl-14

LingBot-Depth

Collection

5 items • Updated 15 days ago • 7

Paper for robbyant/lingbot-depth-pretrain-vitl-14

Masked Depth Modeling for Spatial Perception

Paper • 2601.17895 • Published Jan 25 • 26