| --- |
| license: apache-2.0 |
| base_model: depth-anything/DA3MONO-LARGE |
| pipeline_tag: depth-estimation |
| library_name: coreml |
| tags: |
| - coreml |
| - depth-estimation |
| - monocular-depth |
| - depth-anything |
| - apple-silicon |
| - stereo |
| --- |
| |
| # DepthAnythingV3Mono-CoreML |
|
|
| A **CoreML conversion** of [`depth-anything/DA3MONO-LARGE`](https://huggingface.co/depth-anything/DA3MONO-LARGE) |
| β the monocular-depth variant of Depth Anything 3 (DINOv2 ViT-L backbone + DPT head, ~0.35B params) β |
| packaged for on-device inference on Apple Silicon (macOS 14+). |
|
|
| This is a derivative work of the original model, which is licensed **Apache-2.0**; this conversion is |
| released under the same license. All credit for the model itself goes to ByteDance / the Depth Anything 3 |
| authors. See the [original repo](https://github.com/bytedance-seed/depth-anything-3). |
|
|
| ## What's in here |
|
|
| - `DepthAnythingV3Mono.mlpackage` β an ML Program, **FP16** weights, minimum deployment target **macOS 14**. |
|
|
| ## Interface |
|
|
| - **Input** `image`: an RGB image, **504Γ504** (a multiple of the DINOv2 patch size, 14). |
| ImageNet normalization is **baked into the graph**; the CoreML `ImageType` only rescales 0β255 β 0β1, |
| so you can hand it a `CVPixelBuffer` built straight from a `CGImage` with no manual preprocessing. |
| - **Output** `depth`: a single-channel `MLMultiArray` of shape `(1, 504, 504)` holding **relative** depth |
| (model-relative units). Consumers typically min-max normalize to `0β¦1`. |
|
|
| ## Conversion notes |
|
|
| Converted with `coremltools` from a `torch.jit.trace` of `backbone β head β depth`. The full |
| Depth Anything 3 `forward()` also runs camera-pose, sky and Gaussian-splat post-processing; those are |
| either inert for the mono model or not traceable (the sky refinement is a data-dependent `torch.quantile`), |
| so only the raw relative-depth path is converted. DINOv2's bicubic positional-embedding interpolation is |
| substituted with **bilinear** (coremltools has no `upsample_bicubic2d`); this is a sub-pixel approximation. |
|
|
| **Fidelity:** on a structured test image, the CoreML output matches the FP32 PyTorch reference with a |
| Pearson correlation of **0.99996** (normalized MAE 0.15%). |
|
|
| ## Usage (Swift / CoreML) |
|
|
| ```swift |
| import CoreML |
| import CoreImage |
| |
| let model = try MLModel(contentsOf: compiledURL) // compile the .mlpackage first |
| // Provide `image` as a 504Γ504 CVPixelBuffer (32BGRA); read `depth` as an MLMultiArray (1Γ504Γ504). |
| ``` |
|
|
| It is used as the default depth model in the SBS 3D image viewer (replacing Depth Anything V2-Large), |
| chosen specifically because DA3MONO-LARGE is Apache-2.0 and therefore safe for commercial distribution. |
|
|
| ## License & attribution |
|
|
| Apache-2.0, inherited from the upstream model. If you use this, please cite the original Depth Anything 3 work. |
|
|