File size: 9,269 Bytes
545af1c 3aa4237 545af1c 3aa4237 545af1c 3aa4237 545af1c 3aa4237 5409063 545af1c 3aa4237 545af1c 3aa4237 545af1c 3aa4237 545af1c 3aa4237 b6fb620 545af1c 3aa4237 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
base_model:
- depth-anything/Depth-Anything-V2-Small
language:
- en
license: cc-by-nc-4.0
pipeline_tag: depth-estimation
tags:
- depth
- relative depth
- depth anything
---
# Depth Anything at Any Condition
## Paper, Project Page, and Code
The model was presented in the paper [Depth Anything at Any Condition](https://huggingface.co/papers/2507.01634).
Project page: [https://ghost233lism.github.io/depthanything-AC-page/](https://ghost233lism.github.io/depthanything-AC-page/)
Code: [https://github.com/HVision-NKU/DepthAnythingAC](https://github.com/HVision-NKU/DepthAnythingAC)
## Abstract
We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.
## Introduction
[DepthAnything-AC](https://arxiv.org/abs/2507.01634) is a robust monocular depth estimation (MDE) model fine-tuned from [DepthAnything-V2](https://github.com/DepthAnything/Depth-Anything-V2), designed for **zero-shot depth estimation under diverse and challenging environmental conditions**, including low light, adverse weather, and sensor distortions.
To address the lack of high-quality annotations in corrupted scenes, we introduce a lightweight **unsupervised consistency regularization** framework that enables training on unlabeled data. Additionally, our proposed **Spatial Distance Constraint** helps the model learn patch-level geometric relationships, enhancing semantic boundaries and fine details.
## Installation
### Requirements
- Python>=3.9
- torch==2.3.0
- torchvision==0.18.0
- torchaudio==2.3.0
- cuda==12.1
### Setup
```bash
git clone https://github.com/HVision-NKU/DepthAnythingAC.git
cd DepthAnythingAC
conda create -n depth_anything_ac python=3.9
conda activate depth_anything_ac
pip install -r requirements.txt
```
## Usage
### Get Depth-Anything-AC Model
Download the pre-trained checkpoints from huggingface:
```bash
mkdir checkpoints
cd checkpoints
# (Optional) Using huggingface mirrors
export HF_ENDPOINT=https://hf-mirror.com
# download DepthAnything-AC model from huggingface
huggingface-cli download --resume-download ghost233lism/DepthAnything-AC --local-dir ghost233lism/DepthAnything-AC
```
### Quick Inference
We provide quick inference scripts for single/batch image input in `tools/`. Please refer to the [infer README](https://github.com/HVision-NKU/DepthAnythingAC/blob/main/tools/README.md) for detailed information.
### Training
We provide the full training process of DepthAnything-AC, including consistency regularization, spatial distance extraction/constraint and wide-used Affine-Invariant Loss Function.
Prepare your configuration in `configs/` file and run:
```bash
bash tools/train.sh <num_gpu> <port>
```
### Evaluation
We provide the direct evaluation for DA-2K, enhanced DA-2K, KITTI, NYU-D, Sintel, ETH3D, DIODE, NuScenes-Night, RobotCar-night, DS-rain/cloud/fog, KITTI-C benchmarks. You may refer to `configs/` for more details.
```bash
bash tools/val.sh <num_gpu> <port> <dataset>
```
## Results
### Quantitative Results
#### DA-2K Multi-Condition Robustness Results
Quantitative results on the enhanced multi-condition DA-2K benchmark, including complex light and climate conditions. The evaluation metric is **Accuracy** β.
| Method | Encoder | **DA-2K** | **DA-2K dark** | **DA-2K fog** | **DA-2K snow** | **DA-2K blur** |
|:-------|:-------:|:---------:|:---------------:|:--------------:|:---------------:|:---------------:|
| DynaDepth | ResNet | 0.655 | 0.652 | 0.613 | 0.605 | 0.633 |
| EC-Depth | ViT-S | 0.753 | 0.732 | 0.724 | 0.713 | 0.701 |
| STEPS | ResNet | 0.577 | 0.587 | 0.581 | 0.561 | 0.577 |
| RobustDepth | ViT-S | 0.724 | 0.716 | 0.686 | 0.668 | 0.680 |
| Weather-Depth | ViT-S | 0.745 | 0.724 | 0.716 | 0.697 | 0.666 |
| DepthPro | ViT-S | 0.947 | 0.872 | 0.902 | 0.793 | 0.772 |
| DepthAnything V1 | ViT-S | 0.884 | 0.859 | 0.836 | 0.880 | 0.821 |
| DepthAnything V2 | ViT-S | 0.952 | 0.910 | 0.922 | 0.880 | 0.862 |
| **Depth Anything AC** | ViT-S | **0.953** | **0.923** | **0.929** | **0.892** | **0.880** |
#### Zero-shot Relative Depth Estimation on Real Complex Benchmarks
Zero-shot evaluation results on challenging real-world scenarios including night scenes, adverse weather conditions, and complex environmental factors. All results use ViT-S encoder.
| Method | Encoder | **NuScenes-night** | | **RobotCar-night** | | **DS-rain** | | **DS-cloud** | | **DS-fog** | |
|:-------|:-------:|:----------------:|:---:|:----------------:|:---:|:---------:|:---:|:----------:|:---:|:--------:|:---:|
| | | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β |
| DynaDepth | ResNet | 0.381 | 0.394 | 0.512 | 0.294 | 0.239 | 0.606 | 0.172 | 0.608 | 0.144 | 0.901 |
| EC-Depth | ViT-S | 0.243 | 0.623 | 0.228 | 0.552 | 0.155 | 0.766 | 0.158 | 0.767 | 0.109 | 0.861 |
| STEPS | ResNet | 0.252 | 0.588 | 0.350 | 0.367 | 0.301 | 0.480 | 0.252 | 0.588 | 0.216 | 0.641 |
| RobustDepth | ViT-S | 0.260 | 0.597 | 0.311 | 0.521 | 0.167 | 0.755 | 0.168 | 0.775 | 0.105 | 0.882 |
| Weather-Depth | ViT-S | - | - | - | - | 0.158 | 0.764 | 0.160 | 0.767 | 0.105 | 0.879 |
| Syn2Real | ViT-S | - | - | - | - | 0.171 | 0.729 | - | - | 0.128 | 0.845 |
| DepthPro | ViT-S | 0.218 | 0.669 | 0.237 | 0.534 | **0.124** | **0.841** | 0.158 | 0.779 | **0.102** | **0.892** |
| DepthAnything V1 | ViT-S | 0.232 | 0.679 | 0.239 | 0.518 | 0.133 | 0.819 | 0.150 | **0.801** | 0.098 | 0.891 |
| DepthAnything V2 | ViT-S | 0.200 | 0.725 | 0.239 | 0.518 | 0.125 | 0.840 | 0.151 | 0.798 | 0.103 | 0.890 |
| **Depth Anything AC** | ViT-S | **0.198** | **0.727** | **0.227** | **0.555** | 0.125 | 0.840 | **0.149** | **0.801** | 0.103 | 0.889 |
*Bold: Best performance, Underlined: Second best performance. NuScenes-night and RobotCar-night represent nighttime driving scenarios. DS-rain, DS-cloud, and DS-fog are DrivingStereo weather variation datasets.*
#### Zero-shot Relative Depth Estimation on Synthetic KITTI-C Benchmarks
Zero-shot evaluation results on synthetic KITTI-C corruption benchmarks, testing robustness against various image degradations and corruptions.
| Method | Encoder | **Dark** | | **Snow** | | **Motion** | | **Gaussian** | |
|:-------|:-------:|:--------:|:---:|:--------:|:---:|:----------:|:---:|:------------:|:---:|
| | | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β | AbsRel β | Ξ΄β β |
| DynaDepth | ResNet | 0.163 | 0.752 | 0.338 | 0.393 | 0.234 | 0.609 | 0.274 | 0.501 |
| STEPS | ResNet | 0.230 | 0.631 | 0.242 | 0.622 | 0.291 | 0.508 | 0.204 | 0.692 |
| DepthPro | ViT-S | 0.145 | 0.793 | 0.197 | 0.685 | 0.170 | 0.746 | 0.170 | 0.745 |
| DepthAnything V2 | ViT-S | **0.130** | 0.832 | 0.115 | 0.872 | 0.127 | 0.840 | 0.157 | 0.785 |
| **Depth Anything AC** | ViT-S | **0.130** | **0.834** | **0.114** | **0.873** | **0.126** | **0.841** | **0.153** | **0.793** |
*KITTI-C includes synthetic corruptions: Dark (low-light conditions), Snow (weather simulation), Motion (motion blur), and Gaussian (noise corruption).*
## Citation
If you find this work useful, please consider citing:
```bibtex
@article{sun2025depth,
title={Depth Anything at Any Condition},
author={Sun, Boyuan and Modi Jin and Bowen Yin and Hou, Qibin},
journal={arXiv preprint arXiv:2507.01634},
year={2025}
}
```
## License
This code is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for non-commercial use only.
Please note that any commercial use of this code requires formal permission prior to use.
## Contact
For technical questions, please contact
sbysbysby123[AT]gmail.com or jin_modi[AT]mail.nankai.edu.cn
For commercial licensing, please contact andrewhoux[AT]gmail.com.
## Acknowledgements
We thank the authors of [DepthAnything](https://github.com/LiheYoung/Depth-Anything) and [DepthAnything V2](https://github.com/DepthAnything/Depth-Anything-V2) for their foundational work. We also acknowledge [DINOv2](https://github.com/facebookresearch/dinov2) for the robust visual encoder, [CorrMatch](https://github.com/BBBBchan/CorrMatch) for their codebase, and [RoboDepth](https://github.com/ldkong1205/RoboDepth) for their contributions. |