Update int4 accuracy numbers after PatchEmbed/RefineNet fixes

1cc1f3d verified 24 days ago

4.94 kB

	---
	license: mit
	library_name: mlx
	tags:
	- mlx
	- hand-pose-estimation
	- apple-silicon
	- wilor
	- mano
	datasets:
	- hand-pose
	base_model: warmshao/WiLoR-mini
	---

	# WiLoR-MLX: Hand Pose Estimation on Apple Silicon

	MLX port of [WiLoR-mini](https://github.com/warmshao/WiLoR-mini) for native Apple Silicon inference. Complete pipeline: ViT-H/16 backbone + MANO hand model + RefineNet refinement.

	Code: [github.com/lyonsno/wilor-mlx](https://github.com/lyonsno/wilor-mlx)

	## Available Weights

	\| Variant \| File \| Size \| Precision \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| float32 \| `wilor-mlx.safetensors` \| 2.4 GB \| Full \| Reference quality, recommended \|
	\| int4 \| `wilor-mlx-int4.safetensors` \| 490 MB \| 4-bit quantized \| 5x smaller download, same speed \|

	Both variants produce near-identical inference speed on Apple Silicon (see benchmarks below). Choose based on download size and precision needs.

	These weights contain only ViT backbone, RefineNet, and learned embedding parameters — no MANO data is bundled or rehosted. `WiLoR.from_pretrained()` handles MANO automatically by fetching upstream [WiLoR-mini](https://huggingface.co/warmshao/WiLoR-mini) assets and converting locally on your machine. The [MANO hand model](https://mano.is.tue.mpg.de/) is separately licensed by the Max Planck Institute.

	## Performance

	Apple M4 Max, single-image (1×256×256×3), float32:

	### Stable live sidecar window (embedded in Perceptasia hand tracking)

	\| Backend \| Model p50 \| Model p90 \| Model p95 \| Model p99 \|
	\|---\|---\|---\|---\|---\|
	\| MLX (wilor-mlx) \| ~61 ms \| ~62 ms \| ~63 ms \| ~66 ms \|
	\| PyTorch MPS (2.5.0) \| ~85 ms \| ~144 ms \| ~238 ms \| ~427 ms \|

	Flat ~61ms with virtually no tail — only 8% spread from p50 to p99. MLX: 500 consecutive frames during stable operation. MPS: 102K-frame manifest history. Live numbers from [Perceptasia](https://github.com/lyonsno/perceptasia).

	### Isolated model benchmark

	\| Backend \| p50 \| p90 \| min \| FPS \|
	\|---\|---\|---\|---\|---\|
	\| MLX (wilor-mlx) \| 36 ms \| 36 ms \| 36 ms \| 28 \|
	\| PyTorch MPS (2.5.0) \| 50 ms \| 51 ms \| 49 ms \| 20 \|

	1.4x faster in pure model compute. Same deterministic input, 100 iterations after 30 warmup, batched timing.

	The advantage also reproduced on a lower-bandwidth M2 Pro validation box: across 80 archived hand-positive camera frames, MLX model-call p50/p90/p95 was 252/355/418ms versus PyTorch MPS 358/490/571ms. A reversed-order audit (PyTorch MPS running first) confirmed the result.

	### Quantization impact on speed

	\| Variant \| p50 \| FPS \| Notes \|
	\|---\|---\|---\|---\|
	\| float32 \| 36 ms \| 28 \| Reference \|
	\| float16 \| 36 ms \| 28 \| Equal ALU throughput on M4 Max \|
	\| int4 \| 37 ms \| 27 \| Dequant overhead ≈ bandwidth savings \|

	On Apple Silicon, float16 and int4 do not improve latency for this model size (210 tokens × 1280 dim). The GPU is compute-overhead-bound, not bandwidth-bound. Int4's value is purely download size reduction (2.4 GB → 490 MB).

	## Numerical Accuracy

	Compared against PyTorch WiLoR-mini on identical float32 inputs:

	\| Variant \| pred_vertices max diff \| pred_keypoints_3d max diff \|
	\|---\|---\|---\|
	\| float32 \| 0.006 (sub-mm) \| 0.006 (sub-mm) \|
	\| int4 \| 0.061 (< 1mm) \| 0.059 (< 1mm) \|

	Both are within visual tolerance for real-time hand tracking.

	## Quick Start

	```python
	from wilor_mlx import WiLoR
	import mlx.core as mx

	# Everything downloads and caches automatically
	# First run requires torch for one-time MANO conversion; after that, torch is not used
	model = WiLoR.from_pretrained()

	# Inference
	image = mx.array(your_256x256_hand_crop) # (1, 256, 256, 3) uint8
	result = model(image)
	mx.eval(result)

	keypoints = result['pred_keypoints_3d'] # (1, 21, 3)
	vertices = result['pred_vertices'] # (1, 778, 3)
	```

	See [github.com/lyonsno/wilor-mlx](https://github.com/lyonsno/wilor-mlx) for full documentation.

	## Architecture

	- ViT-H/16 backbone: 1280 embed dim, 32 layers, 16 heads, 210 tokens (192 patches + 18 learnable)
	- MANO hand model: 778 vertices, 16 joints, Linear Blend Skinning with kinematic chain
	- RefineNet: Multi-scale deconvolution + bilinear grid sampling + MANO parameter refinement
	- Total parameters: ~610M

	## Citation

	```bibtex
	@article{zhan2024wilor,
	title={WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild},
	author={Zhan, Rolandos Alexandros and others},
	year={2024}
	}
	```

	## License

	The wilor-mlx code and these weight files are MIT licensed. The weights contain only ViT backbone, RefineNet, and learned embedding parameters — no MANO data is bundled or rehosted.

	The [MANO hand model](https://mano.is.tue.mpg.de/) is separately licensed by the Max Planck Institute. `WiLoR.from_pretrained()` fetches upstream [WiLoR-mini](https://huggingface.co/warmshao/WiLoR-mini) assets and converts MANO data locally on your machine. You can also supply your own MANO data via `mano_path=...`.