mlboydaisuke
/

MoGe-2-LiteRT

Depth Estimation

monocular-geometry

surface-normals

Model card Files Files and versions

MoGe-2-LiteRT / README.md

mlboydaisuke's picture

Upload README.md with huggingface_hub

b646201 verified 14 days ago

|

History Blame Contribute Delete

2.96 kB

	---
	license: mit
	library_name: litert
	pipeline_tag: depth-estimation
	base_model: Ruicheng/moge-2-vits-normal
	tags:
	- litert
	- tflite
	- on-device
	- android
	- monocular-geometry
	- depth-estimation
	- surface-normals
	- point-cloud
	- dinov2
	---

	# MoGe-2 ViT-S — LiteRT (TFLite) GPU

	On-device [LiteRT](https://ai.google.dev/edge/litert) (`.tflite`) conversion of
	[MoGe-2](https://github.com/microsoft/MoGe) (CVPR'25 Oral) monocular geometry
	estimation, converted from [`Ruicheng/moge-2-vits-normal`](https://huggingface.co/Ruicheng/moge-2-vits-normal)
	(DINOv2 ViT-S backbone, 35M params).

	A single forward pass turns one RGB image into an affine 3D point map,
	surface normals, a confidence mask, and a metric scale — enabling
	depth, surface normals, and a rotatable 3D point cloud on a phone.

	The model runs fully on the LiteRT `CompiledModel` GPU accelerator (ML Drift):
	all 836 ops are GPU-native, no CPU fallback, no Flex ops.

	## Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `moge.tflite` \| 136 MB \| FP32 single-graph model, GPU-compatible \|

	## I/O

	- Input: `[1, 3, 448, 448]` float32, NCHW, RGB normalized to `[0, 1]`
	(ImageNet mean/std is applied inside the graph).
	- Outputs (4):
	- `points` `[1, 448, 448, 3]` — affine point map (`exp` remap: `[xy·exp(z), exp(z)]`)
	- `normal` `[1, 448, 448, 3]` — L2-normalized surface normals
	- `mask` `[1, 448, 448, 1]` — sigmoid confidence (> 0.5 = valid)
	- `scale` `[1, 1, 1, 1]` — metric scale factor

	## Usage (Android, LiteRT CompiledModel)

	```kotlin
	val model = CompiledModel.create(
	context.assets, "moge.tflite",
	CompiledModel.Options(Accelerator.GPU), null
	)
	val inputs = model.createInputBuffers()
	val outputs = model.createOutputBuffers()
	inputs[0].writeFloat(nchwFloatArray) // [1,3,448,448], RGB [0,1]
	model.run(inputs, outputs)
	val points = outputs[0].readFloat() // identify the 4 outputs by element count + range
	```

	A complete Android sample (gallery → normal map + depth) is available in
	[google-ai-edge/litert-samples](https://github.com/google-ai-edge/litert-samples).

	## Performance

	- ~522 ms / frame on a Pixel 8a (Mali-G615) GPU.

	## Conversion notes

	Converted with [litert-torch](https://github.com/google-ai-edge/ai-edge-torch)
	(NCHW preserved — required for ViT attention accuracy). Making DINOv2 + the
	ConvStack decoder fully GPU-compatible required nine graph rewrites
	(LayerScale bake, fused-qkv decomposition, position-embedding bake,
	ConvTranspose → bilinear+1×1, etc.). Verified: all ops GPU-native, output
	correlation ≈ 1.0 vs. the PyTorch reference.

	## License & attribution

	- Model: MIT (original [microsoft/MoGe](https://github.com/microsoft/MoGe/blob/main/LICENSE)).
	- DINOv2 backbone components: Apache-2.0.
	- This is a format conversion of `Ruicheng/moge-2-vits-normal`; all credit to the
	original authors (Microsoft Research).