Instructions to use mlboydaisuke/Depth-Anything-3-Small-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/Depth-Anything-3-Small-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
license: apache-2.0
library_name: litert
pipeline_tag: depth-estimation
tags:
- litert
- tflite
- depth-estimation
- monocular-depth
- on-device
- gpu
- depth-anything
base_model: depth-anything/DA3-SMALL
Depth Anything 3 (Small) — LiteRT GPU, monocular depth
On-device LiteRT / TFLite conversion of Depth Anything 3 — Small
(ByteDance-Seed, Apache-2.0) for monocular depth, running fully on the mobile GPU via the LiteRT
CompiledModel API (ML Drift delegate). No CPU fallback ops — the whole graph is GPU-compatible.
| Task | Monocular depth (single RGB → depth) |
| Backbone | DINOv2 ViT-S + RoPE, DPT/DualDPT depth head |
| Input | [1, 3, 896, 504] NCHW float32, ImageNet-normalized, native portrait aspect |
| Output | [1, 1, 896, 504] depth |
| Precision / size | FP16, 55 MB |
| Device | Pixel 8a, LiteRT GPU (Accelerator.GPU), ~1.8 s / image |
| Fidelity | Pearson corr 0.99948 vs the official PyTorch DA3-Small pipeline |
Why a fixed 896×504 (native aspect, not square)
DA3 processes images at their native aspect ratio (upper_bound_resize, longer side → 896, multiple of 14).
Forcing a square 896×896 and letterbox-padding drops the match to corr 0.977 (the black padding leaks into
the content through global attention). Converting at the native rectangle restores corr 0.9994 and is also
faster (fewer tokens). This checkpoint is built for portrait ~9:16. For another aspect, re-convert at that
shape (or your camera's fixed aspect) with the script below.
Preprocessing (must match)
resize to 504×896 (W×H) → x/255 → (x - mean) / std
mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225] # ImageNet, RGB, NCHW
GPU-clean conversion (what was patched)
Converted with litert-torch. DA3 is not GPU-clean out of the box; the following exact, GPU-clean rewrites
were applied (all numerically faithful unless noted):
- checkpoint
model.key-prefix strip (load fix) - RoPE
max_position = int(positions.max())+1→ constant (torch.export data-dependent) - fused-QKV attention → 3 separate Linears + 4D attention (avoids 5D RESHAPE; exact, 1e-6)
- LayerScale
gammafolded intoattn.proj/mlp.fc2(the LayerScale MUL otherwise mis-lays-out the token dim on the GPU delegate:fully_connected {1,1,N,C} vs {N,1,1,C}) pos_embedbicubic interpolation baked to a constant (the interpolate of a constant emitsGATHER_NDon desktop andRESIZE_BILINEARwith 0 runtime inputs on device)- ConvTranspose2d(k=s,stride=s) → zero-stuff (nearest-upsample × top-left mask) +
Conv2d(flipped weight) — exact equivalent (~1e-7), because the Pixel-8a GPU rejectsTRANSPOSE_CONVand the conv+ depth-to-space alternative needs >4D - DPT-head
custom_interpolatealign_corners=True → False(GPU bansalign_corners=Trueresize) — the only non-exact rewrite; source of the residual ~0.05 % vs the official model - head UV pos-embed-again disabled (its
make_sincosbroadcast emitsBROADCAST_TO; ratio-0.1 refinement) - camera-token insertion
x[:, :, 0] = cam_token→torch.cat(in-place index-assign →SELECT_V2)
Net result: GATHER_ND = 0, no >4D tensors, no TRANSPOSE_CONV / BROADCAST_TO / banned ops.
Fidelity note (honest)
corr 0.99948 vs the official FP32 PyTorch pipeline. FP16 is not a factor (FP32≡FP16, corr 1.0). The
residual ~0.05 % is the align_corners=True→False change in (7), which the mobile GPU forces — an irreducible
hardware constraint, not a conversion error. Structure and edge sharpness are visually identical.
Usage (Android / LiteRT CompiledModel)
val model = CompiledModel.create(context.assets, "da3_small_gpu_fp16.tflite",
CompiledModel.Options(Accelerator.GPU), null)
// input: [1,3,896,504] NCHW, ImageNet-normalized; output: [1,1,896,504] depth
License
Apache-2.0, inherited from the upstream Depth Anything 3.