Instructions to use litert-community/MiDaS-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use litert-community/MiDaS-small with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
MiDaS small β LiteRT (fp16, NHWC, GPU-clean)
midas_small_256_fp16.tflite is MiDaS v2.1 small (MiDaS_small, the CNN MiDaS
with an EfficientNet-Lite3 backbone β not the DPT/ViT variants) converted to
LiteRT for on-device monocular depth estimation. Given one RGB image it
predicts a per-pixel inverse-depth map (near = bright, far = dark).
It is the model used by the LiteRT compiled_model_api/depth_estimation Android
sample.
Files
| File | Precision | Size |
|---|---|---|
midas_small_256_fp16.tflite |
fp16 weights | ~33 MB |
Specs
| Task | Monocular depth estimation |
| Source | torch.hub.load("intel-isl/MiDaS", "MiDaS_small") |
| Input | 1 x 256 x 256 x 3 float32, RGB, ImageNet-normalized, NHWC (interleaved) |
| Output | 1 x 256 x 256 float32, relative inverse depth |
Pre-processing: resize to 256Γ256, normalize with ImageNet stats
(mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225] on [0,1] pixels),
write as interleaved NHWC RGB float32.
Post-processing: min-max normalize the output and map through a color LUT
(the sample uses inferno).
Why this conversion
The graph lowers entirely to GPU-clean builtins β no attention, no Flex/Custom
ops, no GATHER_ND, no >4D reshapes:
CONV_2D x73, ADD x27, DEPTHWISE_CONV_2D x24, RELU x7, RESIZE_BILINEAR x5, RESHAPE x1
- Channel-last I/O (
to_channel_last_io) so the model takes NHWC1x256x256x3directly, matching the interleaved RGB the app writes (no input transpose). - fp16 via AI Edge Quantizer
FLOAT_CASTINGβ half the size, runs natively on the GPU delegate. Dynamic-range int8 is intentionally avoided (it favors the CPU/XNNPACK path, not the GPU delegate).
Fidelity
- Converted fp32 vs. original PyTorch (real image): corr 1.0000, max|diff| ~1.6e-3.
- fp16 vs. fp32: corr 0.9999998 (β0.27 % of the depth range).
On-device (Pixel 8a, verified)
The fp16 model compiles to 234 / 234 nodes on the LiteRT GPU delegate
(LITERT_CL) β full GPU residency, no CPU fallback β at ~1β3 ms / inference
(best 1.1 ms). RESIZE_BILINEAR align_corners=True is GPU-supported as-is; no
model change needed.
Training data & PII
This is a weights-exact format conversion of Intel ISL's MiDaS v2.1 small; no new training was performed. MiDaS was trained for monocular depth on a mix of ~10 public depth datasets (e.g. ReDWeb, DIML, MegaDepth, WSVD, 3D Movies). These contain photos of real scenes that may incidentally include people and other PII; none was deliberately collected and this conversion adds none. The model outputs a relative-depth map only and performs no identification. Apply your own content/PII filtering before deployment. See the original MiDaS repo for dataset details.
License & attribution
- MiDaS weights: MIT (Intel ISL).
- EfficientNet-Lite3 backbone: Apache-2.0.
Original work: Ranftl et al., "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer" (MiDaS), https://github.com/isl-org/MiDaS.
Reproducing the conversion
A self-contained converter (litert-torch + ai-edge-quantizer) lives in the
sample under compiled_model_api/depth_estimation/conversion/:
pip install litert-torch ai-edge-quantizer torch timm matplotlib pillow
python convert_midas_litert.py out 256
- Downloads last month
- 12