Mirror of mlboydaisuke/Depth-Anything-3-CoreAI

401b639 verified 1 day ago

5.79 kB

	---
	license: apache-2.0
	tags:
	- depth-estimation
	- monocular-depth
	- core-ai
	- coreai
	- apple
	- on-device
	- depth-anything
	pipeline_tag: depth-estimation
	base_model:
	- depth-anything/DA3-SMALL
	- depth-anything/DA3-BASE
	library_name: coreai
	---

	> Mirror of [`mlboydaisuke/Depth-Anything-3-CoreAI`](https://huggingface.co/mlboydaisuke/Depth-Anything-3-CoreAI) — the canonical repo ([CoreAI Model Zoo](https://github.com/john-rocky/coreai-model-zoo)). Updates land there first.


	# Depth Anything 3 — Core AI

	The [coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)'s first depth model.
	Monocular (single-image) relative depth estimation running fully on-device on Apple's Core AI
	runtime, as a single static `.aimodel`. A conversion of ByteDance's
	[Depth Anything 3](https://github.com/ByteDance-Seed/depth-anything-3)
	([`depth-anything/DA3-SMALL`](https://huggingface.co/depth-anything/DA3-SMALL) /
	[`DA3-BASE`](https://huggingface.co/depth-anything/DA3-BASE), Apache-2.0): a DINOv2 ViT backbone +
	DPT-style head. Drop in an RGB image, get a depth map (and a confidence map). No NMS, no sampling —
	host post-processing is just a colormap.

	<!-- gen-cards:use-it begin id=depth-anything-3-small (managed by scripts/gen-cards — edit cards.json / QuickStart.swift, not this block) -->
	## Use it

	▶️ Run it (source) — the [DepthCamera runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/DepthCamera)
	(live camera depth, one app for every depth model in the catalog):

	```bash
	git clone https://github.com/john-rocky/coreai-kit
	open coreai-kit/Examples/DepthCamera/DepthCamera.xcodeproj
	# → Run, then pick "Depth Anything 3 Small" in the model picker

	# agents / headless (macOS):
	cd coreai-kit/Examples/DepthCamera
	swift run depth-cli --model depth-anything-3-small --image sample.jpg --output depth.png
	```

	💻 Build with it — complete; the glue is kit API, copy-paste runs:

	```swift
	import CoreAIKitVision

	let estimator = try await DepthEstimator(catalog: "depth-anything-3-small")
	let image = try ImageFile.load(imageURL) // any image file → CGImage + EXIF orientation
	let depth = try await estimator.estimateDepth(for: image.cgImage)
	// depth: DepthMap — .cgImage() renders it, .values are the raw floats
	```

	The take-home is [`Examples/DepthCamera/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/DepthCamera/Sources/QuickStart.swift)
	— this exact code as one typed function, no UI; the CLI is an argument shell over it, and
	the GUI runs the same estimator on every camera frame (`CameraFeed`, ~10 lines).
	Live camera? `CameraFeed` (kit API) streams frames — feed each one to
	`estimateDepth(for:)`; the camera permission prompt is your app's own chrome.

	Integration checklist

	- SPM: `https://github.com/john-rocky/coreai-kit` → product CoreAIKitVision
	- Info.plist: `NSCameraUsageDescription` — only for the live camera; the snippet needs none
	- Entitlements: none needed
	- First run downloads the model — 0.1 GB (Mac) / 0.1 GB (iPhone) — then it loads from the
	local cache (Application Support; progress via the `downloadProgress` callback)
	- Measure in Release — Debug is ~3× slower on per-token host work
	<!-- gen-cards:use-it end -->

	## Bundles

	\| dir \| variant \| params \| dtype \| size \| M4 Max GPU \|
	\|---\|---\|---\|---\|---\|---\|
	\| `small/da3-small_float16.aimodel` \| ViT-S \| 34.3M \| fp16 \| 54 MB \| 65.7 FPS \|
	\| `small/da3-small_float32.aimodel` \| ViT-S \| 34.3M \| fp32 \| 105 MB \| 56.5 FPS \|
	\| `base/da3-base_float16.aimodel` \| ViT-B \| 135.4M \| fp16 \| 202 MB \| 26.5 FPS \|
	\| `base/da3-base_float32.aimodel` \| ViT-B \| 135.4M \| fp32 \| 402 MB \| 23.0 FPS \|

	`small · fp16` is the on-device hero — 54 MB, 65 FPS at 504² on an M4 Max, comfortably real-time on
	iPhone-class GPUs. Each `.aimodel` is a directory bundle (`main.mlirb` + `metadata.json`).

	## I/O contract

	```
	input : image [1, 3, 504, 504] RGB, raw [0, 1] (ImageNet normalization is folded into the graph)
	output: depth [1, 504, 504] relative depth (exp-activated; larger = nearer)
	depth_conf [1, 504, 504] confidence
	```

	Host: resize the RGB image to 504 × 504 (e.g. cv2 `INTER_AREA`), feed raw [0, 1], run, then resize
	the depth map back to the original H × W. For display, the DA3 convention is inverse-depth →
	percentile 2–98 normalize → `Spectral` colormap.

	## Fidelity

	- Bit-exact conversion: the Core AI engine matches the PyTorch reference at cos 1.000000 (≤
	~1e-5 / ~1e-2 per-pixel for fp32 / fp16) on both CPU and GPU, at any fixed input shape.
	- vs the official DA3 viewer: mean Pearson r ≈ 0.98 across diverse aspect ratios (square
	inputs r = 1.000) — within DA3's own resolution sensitivity (its 504-vs-518 outputs differ by
	r ≈ 0.975–0.984).

	## Usage (CoreAIKit / coreai.runtime)

	```python
	import coreai.runtime as rt, numpy as np
	from PIL import Image

	m = await rt.AIModel.load("small/da3-small_float16.aimodel",
	rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
	fn = m.load_function("main")

	img = np.asarray(Image.open("photo.jpg").convert("RGB").resize((504, 504)))
	x = (img.astype(np.float16) / 255.0).transpose(2, 0, 1)[None] # raw [0,1], NCHW
	depth = (await fn({"image": rt.NDArray(x)}))["depth"].numpy().reshape(504, 504)
	```

	## Links

	- Conversion script + model card: [coreai-model-zoo `zoo/depth-anything-3.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/depth-anything-3.md)
	- Source: [Depth Anything 3](https://github.com/ByteDance-Seed/depth-anything-3) · Apache-2.0

	---

	*On-device ML / Core ML / Core AI model porting — get in touch: open an issue on the
	[zoo](https://github.com/john-rocky/coreai-model-zoo).*