MiDaS small — LiteRT (fp16, NHWC, GPU-clean)

midas_small_256_fp16.tflite is MiDaS v2.1 small (MiDaS_small, the CNN MiDaS with an EfficientNet-Lite3 backbone — not the DPT/ViT variants) converted to LiteRT for on-device monocular depth estimation. Given one RGB image it predicts a per-pixel inverse-depth map (near = bright, far = dark).

It is the model used by the LiteRT compiled_model_api/depth_estimation Android sample.

Files

File	Precision	Size
`midas_small_256_fp16.tflite`	fp16 weights	~33 MB

Specs


Task	Monocular depth estimation
Source	`torch.hub.load("intel-isl/MiDaS", "MiDaS_small")`
Input	`1 x 256 x 256 x 3` float32, RGB, ImageNet-normalized, NHWC (interleaved)
Output	`1 x 256 x 256` float32, relative inverse depth

Pre-processing: resize to 256×256, normalize with ImageNet stats (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225] on [0,1] pixels), write as interleaved NHWC RGB float32.

Post-processing: min-max normalize the output and map through a color LUT (the sample uses inferno).

Why this conversion

The graph lowers entirely to GPU-clean builtins — no attention, no Flex/Custom ops, no GATHER_ND, no >4D reshapes:

CONV_2D x73, ADD x27, DEPTHWISE_CONV_2D x24, RELU x7, RESIZE_BILINEAR x5, RESHAPE x1

Channel-last I/O (to_channel_last_io) so the model takes NHWC 1x256x256x3 directly, matching the interleaved RGB the app writes (no input transpose).
fp16 via AI Edge Quantizer FLOAT_CASTING — half the size, runs natively on the GPU delegate. Dynamic-range int8 is intentionally avoided (it favors the CPU/XNNPACK path, not the GPU delegate).

Fidelity

Converted fp32 vs. original PyTorch (real image): corr 1.0000, max|diff| ~1.6e-3.
fp16 vs. fp32: corr 0.9999998 (≈0.27 % of the depth range).

On-device (Pixel 8a, verified)

The fp16 model compiles to 234 / 234 nodes on the LiteRT GPU delegate (LITERT_CL) — full GPU residency, no CPU fallback — at ~1–3 ms / inference (best 1.1 ms). RESIZE_BILINEAR align_corners=True is GPU-supported as-is; no model change needed.

Training data & PII

This is a weights-exact format conversion of Intel ISL's MiDaS v2.1 small; no new training was performed. MiDaS was trained for monocular depth on a mix of ~10 public depth datasets (e.g. ReDWeb, DIML, MegaDepth, WSVD, 3D Movies). These contain photos of real scenes that may incidentally include people and other PII; none was deliberately collected and this conversion adds none. The model outputs a relative-depth map only and performs no identification. Apply your own content/PII filtering before deployment. See the original MiDaS repo for dataset details.

License & attribution

MiDaS weights: MIT (Intel ISL).
EfficientNet-Lite3 backbone: Apache-2.0.

Original work: Ranftl et al., "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer" (MiDaS), https://github.com/isl-org/MiDaS.

Reproducing the conversion

A self-contained converter (litert-torch + ai-edge-quantizer) lives in the sample under compiled_model_api/depth_estimation/conversion/:

pip install litert-torch ai-edge-quantizer torch timm matplotlib pillow
python convert_midas_litert.py out 256

Downloads last month: 12

Inference Providers NEW

Depth Estimation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support