Depth Estimation
LiteRT
LiteRT
on-device
android
monocular-geometry
surface-normals
point-cloud
dinov2
Instructions to use mlboydaisuke/MoGe-2-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/MoGe-2-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: litert | |
| pipeline_tag: depth-estimation | |
| base_model: Ruicheng/moge-2-vits-normal | |
| tags: | |
| - litert | |
| - tflite | |
| - on-device | |
| - android | |
| - monocular-geometry | |
| - depth-estimation | |
| - surface-normals | |
| - point-cloud | |
| - dinov2 | |
| # MoGe-2 ViT-S — LiteRT (TFLite) GPU | |
| On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of | |
| **[MoGe-2](https://github.com/microsoft/MoGe)** (CVPR'25 Oral) monocular geometry | |
| estimation, converted from [`Ruicheng/moge-2-vits-normal`](https://huggingface.co/Ruicheng/moge-2-vits-normal) | |
| (DINOv2 ViT-S backbone, 35M params). | |
| A single forward pass turns one RGB image into an **affine 3D point map**, | |
| **surface normals**, a **confidence mask**, and a **metric scale** — enabling | |
| depth, surface normals, and a rotatable 3D point cloud on a phone. | |
| The model runs **fully on the LiteRT `CompiledModel` GPU accelerator** (ML Drift): | |
| all 836 ops are GPU-native, no CPU fallback, no Flex ops. | |
| ## Files | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | `moge.tflite` | 136 MB | FP32 single-graph model, GPU-compatible | | |
| ## I/O | |
| - **Input**: `[1, 3, 448, 448]` float32, **NCHW**, RGB normalized to `[0, 1]` | |
| (ImageNet mean/std is applied *inside* the graph). | |
| - **Outputs** (4): | |
| - `points` `[1, 448, 448, 3]` — affine point map (`exp` remap: `[xy·exp(z), exp(z)]`) | |
| - `normal` `[1, 448, 448, 3]` — L2-normalized surface normals | |
| - `mask` `[1, 448, 448, 1]` — sigmoid confidence (> 0.5 = valid) | |
| - `scale` `[1, 1, 1, 1]` — metric scale factor | |
| ## Usage (Android, LiteRT CompiledModel) | |
| ```kotlin | |
| val model = CompiledModel.create( | |
| context.assets, "moge.tflite", | |
| CompiledModel.Options(Accelerator.GPU), null | |
| ) | |
| val inputs = model.createInputBuffers() | |
| val outputs = model.createOutputBuffers() | |
| inputs[0].writeFloat(nchwFloatArray) // [1,3,448,448], RGB [0,1] | |
| model.run(inputs, outputs) | |
| val points = outputs[0].readFloat() // identify the 4 outputs by element count + range | |
| ``` | |
| A complete Android sample (gallery → normal map + depth) is available in | |
| [google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples). | |
| ## Performance | |
| - ~522 ms / frame on a Pixel 8a (Mali-G615) GPU. | |
| ## Conversion notes | |
| Converted with [litert-torch](https://github.com/google-ai-edge/ai-edge-torch) | |
| (NCHW preserved — required for ViT attention accuracy). Making DINOv2 + the | |
| ConvStack decoder fully GPU-compatible required nine graph rewrites | |
| (LayerScale bake, fused-qkv decomposition, position-embedding bake, | |
| ConvTranspose → bilinear+1×1, etc.). Verified: all ops GPU-native, output | |
| correlation ≈ 1.0 vs. the PyTorch reference. | |
| ## License & attribution | |
| - Model: **MIT** (original [microsoft/MoGe](https://github.com/microsoft/MoGe/blob/main/LICENSE)). | |
| - DINOv2 backbone components: Apache-2.0. | |
| - This is a format conversion of `Ruicheng/moge-2-vits-normal`; all credit to the | |
| original authors (Microsoft Research). | |