LingBot-Depth (Pretrained)

LingBot-Depth transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. This is the general-purpose pretrained model for depth refinement tasks.

Model Details

Model Description

LingBot-Depth employs a masked depth modeling (MDM) approach that treats missing depth measurements from RGB-D sensors not as noise, but as a natural masking signal that highlights geometric ambiguities. The model learns joint representations from RGB appearance context and valid depth observations, enabling robust depth reasoning under incomplete observations.

  • Developed by: Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue
  • Model type: Vision Transformer for depth completion and refinement
  • License: Apache 2.0

Model Sources

Related Models

Model Hugging Face Model ModelScope Model Description
LingBot-Depth robbyant/lingbot-depth-pretrain-vitl-14 robbyant/lingbot-depth-pretrain-vitl-14 General-purpose depth refinement
LingBot-Depth-DC robbyant/lingbot-depth-postrain-dc-vitl14 robbyant/lingbot-depth-postrain-dc-vitl14 Optimized for sparse depth completion

Uses

Direct Use

  • Depth Completion: Filling missing regions in raw RGB-D sensor depth maps with metric accuracy
  • Depth Refinement: Improving noisy depth measurements from consumer-grade depth cameras
  • Point Cloud Generation: Producing clean 3D point clouds from RGB-D inputs

Downstream Use

  • Scene Reconstruction: High-fidelity indoor mapping with strong depth priors
  • 4D Point Tracking: Accurate dynamic tracking in metric space for robot learning
  • Dexterous Manipulation: Robust robotic grasping with precise geometric understanding
  • Monocular Depth Estimation: As a pretrained backbone for depth estimation models
  • Stereo Matching: As a depth prior for stereo matching networks (e.g., FoundationStereo)

Technical Specifications

Model Architecture

  • Encoder: ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
  • Decoder: ConvStack decoder with hierarchical upsampling
  • Objective: Masked depth modeling
  • Model size: ~300M parameters

Software Requirements

  • Python >= 3.9
  • PyTorch >= 2.0.0
  • xformers

Citation

@article{lingbot-depth2026,
  title={Masked Depth Modeling for Spatial Perception},
  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
  journal={arXiv preprint arXiv:2601.17895},
  year={2026}
}

Model Card Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including robbyant/lingbot-depth-pretrain-vitl-14

Paper for robbyant/lingbot-depth-pretrain-vitl-14