mlboydaisuke's picture
Upload README.md with huggingface_hub
843396d verified
|
Raw
History Blame Contribute Delete
5.68 kB
---
license: apache-2.0
library_name: litert
pipeline_tag: depth-estimation
tags:
- litert
- tflite
- depth-estimation
- monocular-depth
- on-device
- gpu
- depth-anything
base_model: depth-anything/DA3-SMALL
---
# Depth Anything 3 (Small) β€” LiteRT GPU, monocular depth
On-device **LiteRT / TFLite** conversion of [**Depth Anything 3 β€” Small**](https://huggingface.co/depth-anything/DA3-SMALL)
(ByteDance-Seed, Apache-2.0) for **monocular depth**, running fully on the mobile **GPU** via the LiteRT
`CompiledModel` API (ML Drift delegate). No CPU fallback ops β€” the whole graph is GPU-compatible.
| | |
|---|---|
| Task | Monocular depth (single RGB β†’ depth) |
| Backbone | DINOv2 ViT-S + RoPE, DPT/DualDPT depth head |
| Input | `[1, 3, 896, 504]` NCHW float32, ImageNet-normalized, **native portrait aspect** |
| Output | `[1, 1, 896, 504]` depth |
| Precision / size | FP16, **55 MB** |
| Device | Pixel 8a, LiteRT GPU (`Accelerator.GPU`), **~0.9 s / image** (FP16, CompiledModel.Run) |
| Fidelity | **corr 0.99948** vs official PyTorch; on-device **GPU-vs-CPU cos 0.99993** (re-verified, see below) |
## Why a fixed 896Γ—504 (native aspect, not square)
DA3 processes images at their **native aspect ratio** (`upper_bound_resize`, longer side β†’ 896, multiple of 14).
Forcing a square `896Γ—896` and letterbox-padding drops the match to corr **0.977** (the black padding leaks into
the content through global attention). Converting at the native rectangle restores **corr 0.9994** and is also
faster (fewer tokens). This checkpoint is built for **portrait ~9:16**. For another aspect, re-convert at that
shape (or your camera's fixed aspect) with the script below.
## Preprocessing (must match)
```
resize to 504Γ—896 (WΓ—H) β†’ x/255 β†’ (x - mean) / std
mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225] # ImageNet, RGB, NCHW
```
## GPU-clean conversion (what was patched)
Converted with `litert-torch`. DA3 is not GPU-clean out of the box; the following exact, GPU-clean rewrites
were applied (all numerically faithful unless noted):
1. checkpoint `model.` key-prefix strip (load fix)
2. RoPE `max_position = int(positions.max())+1` β†’ constant (torch.export data-dependent)
3. fused-QKV attention β†’ 3 separate Linears + 4D attention (avoids 5D RESHAPE; exact, 1e-6)
4. **LayerScale** `gamma` folded into `attn.proj` / `mlp.fc2` (the LayerScale MUL otherwise mis-lays-out the
token dim on the GPU delegate: `fully_connected {1,1,N,C} vs {N,1,1,C}`)
5. `pos_embed` bicubic interpolation **baked** to a constant (the interpolate of a constant emits `GATHER_ND`
on desktop and `RESIZE_BILINEAR` with 0 runtime inputs on device)
6. **ConvTranspose2d(k=s,stride=s)** β†’ zero-stuff (nearest-upsample Γ— top-left mask) + `Conv2d` (flipped
weight) β€” exact equivalent (~1e-7), because the Pixel-8a GPU rejects `TRANSPOSE_CONV` and the conv+
depth-to-space alternative needs >4D
7. DPT-head `custom_interpolate` `align_corners=True β†’ False` (GPU bans `align_corners=True` resize) β€” **the
only non-exact rewrite**; source of the residual ~0.05 % vs the official model
8. head UV pos-embed-again disabled (its `make_sincos` broadcast emits `BROADCAST_TO`; ratio-0.1 refinement)
9. camera-token insertion `x[:, :, 0] = cam_token` β†’ `torch.cat` (in-place index-assign β†’ `SELECT_V2`)
Net result: `GATHER_ND = 0`, no `>4D` tensors, no `TRANSPOSE_CONV` / `BROADCAST_TO` / banned ops.
## Fidelity note (honest)
corr **0.99948** vs the official FP32 PyTorch pipeline. FP16 is **not** a factor (FP32≑FP16, corr 1.0). The
residual ~0.05 % is the `align_corners=True→False` change in (7), which the mobile GPU forces — an irreducible
hardware constraint, not a conversion error. Structure and edge sharpness are visually identical.
## On-device GPU verification (re-confirmed)
Re-verified on a Pixel 8a with the official LiteRT C++ runtime + ML Drift accelerator: the model compiles
to **`Replacing 1460 out of 1460 node(s) with delegate (LITERT_CL)`** (full residency, single partition,
no XNNPACK CPU fallback), and the **on-device GPU output matches the CPU/XNNPACK reference at cos 0.99993 /
Pearson 0.99975** for the same input β€” i.e. the GPU result is numerically faithful, not merely "resident"
(GPU full residency does not by itself guarantee a correct result).
## Usage (Android / LiteRT CompiledModel)
```kotlin
val model = CompiledModel.create(context.assets, "da3_small_gpu_fp16.tflite",
CompiledModel.Options(Accelerator.GPU), null)
// input: [1,3,896,504] NCHW, ImageNet-normalized; output: [1,1,896,504] depth
```
## Training data & PII
Depth Anything 3 was trained by ByteDance-Seed on a large-scale collection of monocular-depth data β€” a mix
of **synthetic depth datasets and real images with pseudo-labelled depth** (the Depth Anything line scales
to tens of millions of images). No new training was performed for this conversion β€” it is a weights-faithful
(corr β‰ˆ 1.0) format change of the public `depth-anything/DA3-SMALL` checkpoint. Because the source data
includes real-world indoor/outdoor scenes, it may incidentally contain people, faces, vehicles, signage and
other PII; no PII was deliberately collected and this conversion adds none. Apply your own content/PII
filtering as appropriate. See the original [Depth Anything 3](https://github.com/ByteDance-Seed/depth-anything-3)
release and [paper](https://arxiv.org/abs/2511.10647) for full dataset details.
## License
Apache-2.0, inherited from the upstream [Depth Anything 3](https://github.com/ByteDance-Seed/depth-anything-3).
This is a format conversion; all credit to the original authors (ByteDance-Seed).