sdkv2's picture
Add model card
bc5ba3a verified
|
Raw
History Blame Contribute Delete
2.78 kB
---
license: apache-2.0
base_model: depth-anything/DA3MONO-LARGE
pipeline_tag: depth-estimation
library_name: coreml
tags:
- coreml
- depth-estimation
- monocular-depth
- depth-anything
- apple-silicon
- stereo
---
# DepthAnythingV3Mono-CoreML
A **CoreML conversion** of [`depth-anything/DA3MONO-LARGE`](https://huggingface.co/depth-anything/DA3MONO-LARGE)
β€” the monocular-depth variant of Depth Anything 3 (DINOv2 ViT-L backbone + DPT head, ~0.35B params) β€”
packaged for on-device inference on Apple Silicon (macOS 14+).
This is a derivative work of the original model, which is licensed **Apache-2.0**; this conversion is
released under the same license. All credit for the model itself goes to ByteDance / the Depth Anything 3
authors. See the [original repo](https://github.com/bytedance-seed/depth-anything-3).
## What's in here
- `DepthAnythingV3Mono.mlpackage` β€” an ML Program, **FP16** weights, minimum deployment target **macOS 14**.
## Interface
- **Input** `image`: an RGB image, **504Γ—504** (a multiple of the DINOv2 patch size, 14).
ImageNet normalization is **baked into the graph**; the CoreML `ImageType` only rescales 0–255 β†’ 0–1,
so you can hand it a `CVPixelBuffer` built straight from a `CGImage` with no manual preprocessing.
- **Output** `depth`: a single-channel `MLMultiArray` of shape `(1, 504, 504)` holding **relative** depth
(model-relative units). Consumers typically min-max normalize to `0…1`.
## Conversion notes
Converted with `coremltools` from a `torch.jit.trace` of `backbone β†’ head β†’ depth`. The full
Depth Anything 3 `forward()` also runs camera-pose, sky and Gaussian-splat post-processing; those are
either inert for the mono model or not traceable (the sky refinement is a data-dependent `torch.quantile`),
so only the raw relative-depth path is converted. DINOv2's bicubic positional-embedding interpolation is
substituted with **bilinear** (coremltools has no `upsample_bicubic2d`); this is a sub-pixel approximation.
**Fidelity:** on a structured test image, the CoreML output matches the FP32 PyTorch reference with a
Pearson correlation of **0.99996** (normalized MAE 0.15%).
## Usage (Swift / CoreML)
```swift
import CoreML
import CoreImage
let model = try MLModel(contentsOf: compiledURL) // compile the .mlpackage first
// Provide `image` as a 504Γ—504 CVPixelBuffer (32BGRA); read `depth` as an MLMultiArray (1Γ—504Γ—504).
```
It is used as the default depth model in the SBS 3D image viewer (replacing Depth Anything V2-Large),
chosen specifically because DA3MONO-LARGE is Apache-2.0 and therefore safe for commercial distribution.
## License & attribution
Apache-2.0, inherited from the upstream model. If you use this, please cite the original Depth Anything 3 work.