sdkv2
/

DepthAnythingV3Mono-CoreML

Depth Estimation

monocular-depth

Model card Files Files and versions

DepthAnythingV3Mono-CoreML / README.md

sdkv2's picture

Add model card

bc5ba3a verified 12 days ago

|

History Blame Contribute Delete

2.78 kB

	---
	license: apache-2.0
	base_model: depth-anything/DA3MONO-LARGE
	pipeline_tag: depth-estimation
	library_name: coreml
	tags:
	- coreml
	- depth-estimation
	- monocular-depth
	- depth-anything
	- apple-silicon
	- stereo
	---

	# DepthAnythingV3Mono-CoreML

	A CoreML conversion of [`depth-anything/DA3MONO-LARGE`](https://huggingface.co/depth-anything/DA3MONO-LARGE)
	— the monocular-depth variant of Depth Anything 3 (DINOv2 ViT-L backbone + DPT head, ~0.35B params) —
	packaged for on-device inference on Apple Silicon (macOS 14+).

	This is a derivative work of the original model, which is licensed Apache-2.0; this conversion is
	released under the same license. All credit for the model itself goes to ByteDance / the Depth Anything 3
	authors. See the [original repo](https://github.com/bytedance-seed/depth-anything-3).

	## What's in here

	- `DepthAnythingV3Mono.mlpackage` — an ML Program, FP16 weights, minimum deployment target macOS 14.

	## Interface

	- Input `image`: an RGB image, 504×504 (a multiple of the DINOv2 patch size, 14).
	ImageNet normalization is baked into the graph; the CoreML `ImageType` only rescales 0–255 → 0–1,
	so you can hand it a `CVPixelBuffer` built straight from a `CGImage` with no manual preprocessing.
	- Output `depth`: a single-channel `MLMultiArray` of shape `(1, 504, 504)` holding relative depth
	(model-relative units). Consumers typically min-max normalize to `0…1`.

	## Conversion notes

	Converted with `coremltools` from a `torch.jit.trace` of `backbone → head → depth`. The full
	Depth Anything 3 `forward()` also runs camera-pose, sky and Gaussian-splat post-processing; those are
	either inert for the mono model or not traceable (the sky refinement is a data-dependent `torch.quantile`),
	so only the raw relative-depth path is converted. DINOv2's bicubic positional-embedding interpolation is
	substituted with bilinear (coremltools has no `upsample_bicubic2d`); this is a sub-pixel approximation.

	Fidelity: on a structured test image, the CoreML output matches the FP32 PyTorch reference with a
	Pearson correlation of 0.99996 (normalized MAE 0.15%).

	## Usage (Swift / CoreML)

	```swift
	import CoreML
	import CoreImage

	let model = try MLModel(contentsOf: compiledURL) // compile the .mlpackage first
	// Provide `image` as a 504×504 CVPixelBuffer (32BGRA); read `depth` as an MLMultiArray (1×504×504).
	```

	It is used as the default depth model in the SBS 3D image viewer (replacing Depth Anything V2-Large),
	chosen specifically because DA3MONO-LARGE is Apache-2.0 and therefore safe for commercial distribution.

	## License & attribution

	Apache-2.0, inherited from the upstream model. If you use this, please cite the original Depth Anything 3 work.