robbyant
/

lingbot-depth-pretrain-vitl-14

+---
+license: apache-2.0
+language:
+- en
+tags:
+- depth-estimation
+- depth-completion
+- rgb-d
+- computer-vision
+- robotics
+- 3d-vision
+- pytorch
+- vision-transformer
+datasets:
+- custom
+metrics:
+- rmse
+- mae
+library_name: pytorch
+pipeline_tag: depth-estimation
+---
+# LingBot-Depth (Pretrained)
+**LingBot-Depth** transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. This is the **general-purpose pretrained model** for depth refinement tasks.
+## Model Details
+### Model Description
+LingBot-Depth employs a masked depth modeling (MDM) approach that treats missing depth measurements from RGB-D sensors not as noise, but as a natural masking signal that highlights geometric ambiguities. The model learns joint representations from RGB appearance context and valid depth observations, enabling robust depth reasoning under incomplete observations.
+- **Developed by:** Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, Nan Xue
+- **Model type:** Vision Transformer for depth completion and refinement
+- **License:** Apache 2.0
+### Model Sources
+- **Repository:** https://github.com/robbyant/lingbot-depth
+- **Paper:** [Masked Depth Modeling for Spatial Perception](https://arxiv.org/abs/2601.xxxxx)
+- **Project Page:** https://technology.robbyant.com/lingbot-depth
+### Related Models
+| Model | Repository | Description |
+|-------|------------|-------------|
+| LingBot-Depth | [robbyant/lingbot-depth-pretrain-vitl-14](https://huggingface.co/robbyant/lingbot-depth-pretrain-vitl-14) | General-purpose depth refinement (this model) |
+| LingBot-Depth-DC | [robbyant/lingbot-depth-postrain-dc-vitl14](https://huggingface.co/robbyant/lingbot-depth-postrain-dc-vitl14) | Optimized for sparse depth completion |
+## Uses
+### Direct Use
+- **Depth Completion**: Filling missing regions in raw RGB-D sensor depth maps with metric accuracy
+- **Depth Refinement**: Improving noisy depth measurements from consumer-grade depth cameras
+- **Point Cloud Generation**: Producing clean 3D point clouds from RGB-D inputs
+### Downstream Use
+- **Scene Reconstruction**: High-fidelity indoor mapping with strong depth priors
+- **4D Point Tracking**: Accurate dynamic tracking in metric space for robot learning
+- **Dexterous Manipulation**: Robust robotic grasping with precise geometric understanding
+- **Monocular Depth Estimation**: As a pretrained backbone for depth estimation models
+- **Stereo Matching**: As a depth prior for stereo matching networks (e.g., FoundationStereo)
+## Technical Specifications
+### Model Architecture
+- **Encoder:** ViT-Large/14 (24 layers) with separated patch embeddings for RGB and depth
+- **Decoder:** ConvStack decoder with hierarchical upsampling
+- **Objective:** Masked depth modeling
+- **Model size:** ~300M parameters
+### Software Requirements
+- Python >= 3.9
+- PyTorch >= 2.0.0
+- xformers
+## Citation
+```bibtex
+@article{lingbot-depth2026,
+  title={Masked Depth Modeling for Spatial Perception},
+  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
+  journal={arXiv preprint arXiv:2601.xxxxx},
+  year={2026}
+}
+```
+## Model Card Contact
+- **Email:** tanbin.tan@antgroup.com, xuenan.xue@antgroup.com
+- **Issues:** https://github.com/robbyant/lingbot-depth/issues

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6ab1da5822e4fea712202616d1f3b683ce4b2f7f82ea58fb3f5ebd7cfae9c0e0
+size 1284841262