LingBot-Depth-DC (Depth Completion)
LingBot-Depth-DC is a post-trained variant of LingBot-Depth, specifically optimized for sparse depth completion tasks. This model excels at recovering dense depth maps from highly sparse inputs such as SfM/SLAM point clouds.
Model Details
Model Description
This model builds upon the LingBot-Depth pretrained checkpoint with additional post-training focused on sparse depth completion scenarios. It is particularly effective for:
Recovering complete depth from sparse SfM/SLAM observations
Handling extremely sparse depth inputs (e.g., <5% valid pixels)
Scenarios where depth sensors are unavailable and only sparse geometric cues exist
Developed by: Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue
Model type: Vision Transformer for sparse depth completion
License: Apache 2.0
Finetuned from model: LingBot-Depth (pretrained)
Model Sources
- Repository: https://github.com/robbyant/lingbot-depth
- Paper: Masked Depth Modeling for Spatial Perception
- Project Page: https://technology.robbyant.com/lingbot-depth
Related Models
| Model | Hugging Face Model | ModelScope Model | Description |
|---|---|---|---|
| LingBot-Depth | robbyant/lingbot-depth-pretrain-vitl-14 | robbyant/lingbot-depth-pretrain-vitl-14 | General-purpose depth refinement |
| LingBot-Depth-DC | robbyant/lingbot-depth-postrain-dc-vitl14 | robbyant/lingbot-depth-postrain-dc-vitl14 | Optimized for sparse depth completion |
Uses
Direct Use
- Sparse Depth Completion: Recovering dense depth from SfM/SLAM sparse point clouds
- Extreme Sparsity Handling: Working with <5% valid depth pixels
- RGB-guided Depth Densification: Using visual context to fill large missing regions
Downstream Use
- SLAM Enhancement: Densifying sparse SLAM outputs for better scene understanding
- Novel View Synthesis: Providing dense geometry for view synthesis pipelines
- 3D Reconstruction: Completing sparse depth for mesh reconstruction
- Robotics Navigation: Dense depth from sparse sensor observations
Technical Specifications
Model Architecture
- Encoder: ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
- Decoder: ConvStack decoder with hierarchical upsampling
- Objective: Masked depth modeling optimized for sparse inputs
- Model size: ~300M parameters
Software Requirements
- Python >= 3.9
- PyTorch >= 2.0.0
- xformers
Citation
@article{lingbot-depth2026,
title={Masked Depth Modeling for Spatial Perception},
author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
journal={arXiv preprint arXiv:2601.17895},
year={2026}
}