Instructions to use mlboydaisuke/Metric3D-v2-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/Metric3D-v2-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Metric3D v2 (ViT-S) β LiteRT (on-device, fully-GPU metric depth)
Metric3D v2 (CVPR/TPAMI 2024) monocular metric (absolute,
in-meters) depth, converted to LiteRT and running fully on the CompiledModel GPU (ML Drift) on
Android. Unlike relative-depth models (MiDaS, Depth Anything), Metric3D predicts depth in meters. The
DINOv2 ViT-S encoder and the RAFT-DPT decoder both ride the GPU delegate β no CPU/ONNX fallback.
On-device (Pixel 8a, Tensor G3 β verified)
| nodes on GPU | 2447 / 2447 LITERT_CL (full residency) |
| compile | ~2.2 s (one-time) |
| inference | ~44 ms (model); ~335 ms full app pipeline |
| size | 78 MB (fp16) |
| accuracy | depth corr 0.96 vs the original Metric3D (0.96β0.98 across indoor 0.7β4 m / mid 4β17 m / outdoor 11β200 m) |
image[1,3,448,448] (ImageNet-normalized) β[GPU: DINOv2 ViT-S β RAFT-DPT (4 iters)]β depth[1,1,448,448] (meters)
The model outputs depth for a canonical camera (focal 1000 at the canonical resolution). For a
calibrated camera multiply by fx / 1000 (the de-canonical transform); with no intrinsics the depth is
already in meters and qualitatively correct.
Preprocessing
Center-crop to square, resize to 448Γ448, ImageNet normalize in 0β255 scale
(px β [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375], NCHW planar.
Usage (Android, LiteRT CompiledModel)
val model = CompiledModel.create(modelPath, CompiledModel.Options(Accelerator.GPU), null)
val input = model.createInputBuffers()
val output = model.createOutputBuffers()
input[0].writeFloat(chw) // [1,3,448,448] ImageNet-normalized
model.run(input, output)
val depth = output[0].readFloat() // [448*448] meters
A complete Android sample (image picker + depth colormap) is in the official
google-ai-edge/litert-samples repo under
compiled_model_api/metric_depth.
How it converts (litert-torch)
Fixed 448Γ448. Encoder = the DINOv2 ViT-S suite (fused-QKV β 4D attention, LayerScale folded into Linear, baked pos-embed). The RAFT-DPT decoder needs three fixes that only the on-device run reveals (desktop fp16 stays at 0.9999):
- Convex upsample β depth-to-space via
ZeroStuffConvT2dβ the naive "nearest-upsample + in-block mask" is exact on desktop but 0.57 on Mali (RESIZE_NEARESTdiffers at non-stride positions);ZeroStuffConvT2dmasks only stride-aligned positions and the conv kernel supplies the offset. - GELU β accurate tanh approximation (POW-free);
xΒ·sigmoid(1.702x)collapses far-depth to 0.51 over the 0.1β200 m log-depth bins, tanh restores 0.96. nn.ReLU(inplace=True)mutates the DPTConvBlockresidual (relu(x)+convs) β replicated exactly.
Conversion scripts: in the litert-samples sample's conversion/ directory.
License
BSD-2-Clause (Metric3D); the DINOv2 backbone is Apache-2.0. Upstream: YvanYin/Metric3D.
- Downloads last month
- -
