--- license: mit library_name: litert pipeline_tag: depth-estimation base_model: Ruicheng/moge-2-vits-normal tags: - litert - tflite - on-device - android - monocular-geometry - depth-estimation - surface-normals - point-cloud - dinov2 --- # MoGe-2 ViT-S — LiteRT (TFLite) GPU On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of **[MoGe-2](https://github.com/microsoft/MoGe)** (CVPR'25 Oral) monocular geometry estimation, converted from [`Ruicheng/moge-2-vits-normal`](https://huggingface.co/Ruicheng/moge-2-vits-normal) (DINOv2 ViT-S backbone, 35M params). A single forward pass turns one RGB image into an **affine 3D point map**, **surface normals**, a **confidence mask**, and a **metric scale** — enabling depth, surface normals, and a rotatable 3D point cloud on a phone. The model runs **fully on the LiteRT `CompiledModel` GPU accelerator** (ML Drift): all 836 ops are GPU-native, no CPU fallback, no Flex ops. ## Files | File | Size | Description | |------|------|-------------| | `moge.tflite` | 136 MB | FP32 single-graph model, GPU-compatible | ## I/O - **Input**: `[1, 3, 448, 448]` float32, **NCHW**, RGB normalized to `[0, 1]` (ImageNet mean/std is applied *inside* the graph). - **Outputs** (4): - `points` `[1, 448, 448, 3]` — affine point map (`exp` remap: `[xy·exp(z), exp(z)]`) - `normal` `[1, 448, 448, 3]` — L2-normalized surface normals - `mask` `[1, 448, 448, 1]` — sigmoid confidence (> 0.5 = valid) - `scale` `[1, 1, 1, 1]` — metric scale factor ## Usage (Android, LiteRT CompiledModel) ```kotlin val model = CompiledModel.create( context.assets, "moge.tflite", CompiledModel.Options(Accelerator.GPU), null ) val inputs = model.createInputBuffers() val outputs = model.createOutputBuffers() inputs[0].writeFloat(nchwFloatArray) // [1,3,448,448], RGB [0,1] model.run(inputs, outputs) val points = outputs[0].readFloat() // identify the 4 outputs by element count + range ``` A complete Android sample (gallery → normal map + depth) is available in [google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples). ## Performance - ~522 ms / frame on a Pixel 8a (Mali-G615) GPU. ## Conversion notes Converted with [litert-torch](https://github.com/google-ai-edge/ai-edge-torch) (NCHW preserved — required for ViT attention accuracy). Making DINOv2 + the ConvStack decoder fully GPU-compatible required nine graph rewrites (LayerScale bake, fused-qkv decomposition, position-embedding bake, ConvTranspose → bilinear+1×1, etc.). Verified: all ops GPU-native, output correlation ≈ 1.0 vs. the PyTorch reference. ## License & attribution - Model: **MIT** (original [microsoft/MoGe](https://github.com/microsoft/MoGe/blob/main/LICENSE)). - DINOv2 backbone components: Apache-2.0. - This is a format conversion of `Ruicheng/moge-2-vits-normal`; all credit to the original authors (Microsoft Research).