---
license: mit
library_name: litert
pipeline_tag: depth-estimation
base_model: Ruicheng/moge-2-vits-normal
tags:
  - litert
  - tflite
  - on-device
  - android
  - monocular-geometry
  - depth-estimation
  - surface-normals
  - point-cloud
  - dinov2
---

# MoGe-2 ViT-S — LiteRT (TFLite) GPU

On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of
**[MoGe-2](https://github.com/microsoft/MoGe)** (CVPR'25 Oral) monocular geometry
estimation, converted from [`Ruicheng/moge-2-vits-normal`](https://huggingface.co/Ruicheng/moge-2-vits-normal)
(DINOv2 ViT-S backbone, 35M params).

A single forward pass turns one RGB image into an **affine 3D point map**,
**surface normals**, a **confidence mask**, and a **metric scale** — enabling
depth, surface normals, and a rotatable 3D point cloud on a phone.

The model runs **fully on the LiteRT `CompiledModel` GPU accelerator** (ML Drift):
all 836 ops are GPU-native, no CPU fallback, no Flex ops.

## Files

| File | Size | Description |
|------|------|-------------|
| `moge.tflite` | 136 MB | FP32 single-graph model, GPU-compatible |

## I/O

- **Input**: `[1, 3, 448, 448]` float32, **NCHW**, RGB normalized to `[0, 1]`
  (ImageNet mean/std is applied *inside* the graph).
- **Outputs** (4):
  - `points` `[1, 448, 448, 3]` — affine point map (`exp` remap: `[xy·exp(z), exp(z)]`)
  - `normal` `[1, 448, 448, 3]` — L2-normalized surface normals
  - `mask`   `[1, 448, 448, 1]` — sigmoid confidence (> 0.5 = valid)
  - `scale`  `[1, 1, 1, 1]`     — metric scale factor

## Usage (Android, LiteRT CompiledModel)

```kotlin
val model = CompiledModel.create(
    context.assets, "moge.tflite",
    CompiledModel.Options(Accelerator.GPU), null
)
val inputs = model.createInputBuffers()
val outputs = model.createOutputBuffers()
inputs[0].writeFloat(nchwFloatArray)   // [1,3,448,448], RGB [0,1]
model.run(inputs, outputs)
val points = outputs[0].readFloat()    // identify the 4 outputs by element count + range
```

A complete Android sample (gallery → normal map + depth) is available in
[google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples).

## Performance

- ~522 ms / frame on a Pixel 8a (Mali-G615) GPU.

## Conversion notes

Converted with [litert-torch](https://github.com/google-ai-edge/ai-edge-torch)
(NCHW preserved — required for ViT attention accuracy). Making DINOv2 + the
ConvStack decoder fully GPU-compatible required nine graph rewrites
(LayerScale bake, fused-qkv decomposition, position-embedding bake,
ConvTranspose → bilinear+1×1, etc.). Verified: all ops GPU-native, output
correlation ≈ 1.0 vs. the PyTorch reference.

## License & attribution

- Model: **MIT** (original [microsoft/MoGe](https://github.com/microsoft/MoGe/blob/main/LICENSE)).
- DINOv2 backbone components: Apache-2.0.
- This is a format conversion of `Ruicheng/moge-2-vits-normal`; all credit to the
  original authors (Microsoft Research).