MoGe-2-LiteRT / README.md
mlboydaisuke's picture
Upload README.md with huggingface_hub
b646201 verified
|
Raw
History Blame Contribute Delete
2.96 kB
---
license: mit
library_name: litert
pipeline_tag: depth-estimation
base_model: Ruicheng/moge-2-vits-normal
tags:
- litert
- tflite
- on-device
- android
- monocular-geometry
- depth-estimation
- surface-normals
- point-cloud
- dinov2
---
# MoGe-2 ViT-S — LiteRT (TFLite) GPU
On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of
**[MoGe-2](https://github.com/microsoft/MoGe)** (CVPR'25 Oral) monocular geometry
estimation, converted from [`Ruicheng/moge-2-vits-normal`](https://huggingface.co/Ruicheng/moge-2-vits-normal)
(DINOv2 ViT-S backbone, 35M params).
A single forward pass turns one RGB image into an **affine 3D point map**,
**surface normals**, a **confidence mask**, and a **metric scale** — enabling
depth, surface normals, and a rotatable 3D point cloud on a phone.
The model runs **fully on the LiteRT `CompiledModel` GPU accelerator** (ML Drift):
all 836 ops are GPU-native, no CPU fallback, no Flex ops.
## Files
| File | Size | Description |
|------|------|-------------|
| `moge.tflite` | 136 MB | FP32 single-graph model, GPU-compatible |
## I/O
- **Input**: `[1, 3, 448, 448]` float32, **NCHW**, RGB normalized to `[0, 1]`
(ImageNet mean/std is applied *inside* the graph).
- **Outputs** (4):
- `points` `[1, 448, 448, 3]` — affine point map (`exp` remap: `[xy·exp(z), exp(z)]`)
- `normal` `[1, 448, 448, 3]` — L2-normalized surface normals
- `mask` `[1, 448, 448, 1]` — sigmoid confidence (> 0.5 = valid)
- `scale` `[1, 1, 1, 1]` — metric scale factor
## Usage (Android, LiteRT CompiledModel)
```kotlin
val model = CompiledModel.create(
context.assets, "moge.tflite",
CompiledModel.Options(Accelerator.GPU), null
)
val inputs = model.createInputBuffers()
val outputs = model.createOutputBuffers()
inputs[0].writeFloat(nchwFloatArray) // [1,3,448,448], RGB [0,1]
model.run(inputs, outputs)
val points = outputs[0].readFloat() // identify the 4 outputs by element count + range
```
A complete Android sample (gallery → normal map + depth) is available in
[google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples).
## Performance
- ~522 ms / frame on a Pixel 8a (Mali-G615) GPU.
## Conversion notes
Converted with [litert-torch](https://github.com/google-ai-edge/ai-edge-torch)
(NCHW preserved — required for ViT attention accuracy). Making DINOv2 + the
ConvStack decoder fully GPU-compatible required nine graph rewrites
(LayerScale bake, fused-qkv decomposition, position-embedding bake,
ConvTranspose → bilinear+1×1, etc.). Verified: all ops GPU-native, output
correlation ≈ 1.0 vs. the PyTorch reference.
## License & attribution
- Model: **MIT** (original [microsoft/MoGe](https://github.com/microsoft/MoGe/blob/main/LICENSE)).
- DINOv2 backbone components: Apache-2.0.
- This is a format conversion of `Ruicheng/moge-2-vits-normal`; all credit to the
original authors (Microsoft Research).