LingBot-Depth-DC (Depth Completion)

LingBot-Depth-DC is a post-trained variant of LingBot-Depth, specifically optimized for sparse depth completion tasks. This model excels at recovering dense depth maps from highly sparse inputs such as SfM/SLAM point clouds.

Model Details

Model Description

This model builds upon the LingBot-Depth pretrained checkpoint with additional post-training focused on sparse depth completion scenarios. It is particularly effective for:

Recovering complete depth from sparse SfM/SLAM observations
Handling extremely sparse depth inputs (e.g., <5% valid pixels)
Scenarios where depth sensors are unavailable and only sparse geometric cues exist
Developed by: Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue
Model type: Vision Transformer for sparse depth completion
License: Apache 2.0
Finetuned from model: LingBot-Depth (pretrained)

Model Sources

Repository: https://github.com/robbyant/lingbot-depth
Paper: Masked Depth Modeling for Spatial Perception
Project Page: https://technology.robbyant.com/lingbot-depth

Related Models

Model	Hugging Face Model	ModelScope Model	Description
LingBot-Depth	robbyant/lingbot-depth-pretrain-vitl-14	robbyant/lingbot-depth-pretrain-vitl-14	General-purpose depth refinement
LingBot-Depth-DC	robbyant/lingbot-depth-postrain-dc-vitl14	robbyant/lingbot-depth-postrain-dc-vitl14	Optimized for sparse depth completion

Uses

Direct Use

Sparse Depth Completion: Recovering dense depth from SfM/SLAM sparse point clouds
Extreme Sparsity Handling: Working with <5% valid depth pixels
RGB-guided Depth Densification: Using visual context to fill large missing regions

Downstream Use

SLAM Enhancement: Densifying sparse SLAM outputs for better scene understanding
Novel View Synthesis: Providing dense geometry for view synthesis pipelines
3D Reconstruction: Completing sparse depth for mesh reconstruction
Robotics Navigation: Dense depth from sparse sensor observations

Technical Specifications

Model Architecture

Encoder: ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
Decoder: ConvStack decoder with hierarchical upsampling
Objective: Masked depth modeling optimized for sparse inputs
Model size: ~300M parameters

Software Requirements

Python >= 3.9
PyTorch >= 2.0.0
xformers

Citation

@article{lingbot-depth2026,
  title={Masked Depth Modeling for Spatial Perception},
  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
  journal={arXiv preprint arXiv:2601.17895},
  year={2026}
}

Model Card Contact

Email: tanbin.tan@antgroup.com, xuenan.xue@antgroup.com
Issues: https://github.com/robbyant/lingbot-depth/issues

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including robbyant/lingbot-depth-postrain-dc-vitl14

LingBot-Depth

Collection

5 items • Updated Mar 21 • 7

Paper for robbyant/lingbot-depth-postrain-dc-vitl14

Masked Depth Modeling for Spatial Perception

Paper • 2601.17895 • Published Jan 25 • 30