Instructions to use lyonsno/wilor-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use lyonsno/wilor-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir wilor-mlx lyonsno/wilor-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| license: mit | |
| library_name: mlx | |
| tags: | |
| - mlx | |
| - hand-pose-estimation | |
| - apple-silicon | |
| - wilor | |
| - mano | |
| datasets: | |
| - hand-pose | |
| base_model: warmshao/WiLoR-mini | |
| # WiLoR-MLX: Hand Pose Estimation on Apple Silicon | |
| MLX port of [WiLoR-mini](https://github.com/warmshao/WiLoR-mini) for native Apple Silicon inference. Complete pipeline: ViT-H/16 backbone + MANO hand model + RefineNet refinement. | |
| **Code:** [github.com/lyonsno/wilor-mlx](https://github.com/lyonsno/wilor-mlx) | |
| ## Available Weights | |
| | Variant | File | Size | Precision | Notes | | |
| |---|---|---|---|---| | |
| | **float32** | `wilor-mlx.safetensors` | 2.4 GB | Full | Reference quality, recommended | | |
| | **int4** | `wilor-mlx-int4.safetensors` | 490 MB | 4-bit quantized | 5x smaller download, same speed | | |
| Both variants produce near-identical inference speed on Apple Silicon (see benchmarks below). Choose based on download size and precision needs. | |
| These weights contain only ViT backbone, RefineNet, and learned embedding parameters β no MANO data is bundled or rehosted. `WiLoR.from_pretrained()` handles MANO automatically by fetching upstream [WiLoR-mini](https://huggingface.co/warmshao/WiLoR-mini) assets and converting locally on your machine. The [MANO hand model](https://mano.is.tue.mpg.de/) is separately licensed by the Max Planck Institute. | |
| ## Performance | |
| **Apple M4 Max, single-image (1Γ256Γ256Γ3), float32:** | |
| ### Stable live sidecar window (embedded in Perceptasia hand tracking) | |
| | Backend | Model p50 | Model p90 | Model p95 | Model p99 | | |
| |---|---|---|---|---| | |
| | **MLX (wilor-mlx)** | **~61 ms** | **~62 ms** | **~63 ms** | **~66 ms** | | |
| | PyTorch MPS (2.5.0) | ~85 ms | ~144 ms | ~238 ms | ~427 ms | | |
| **Flat ~61ms with virtually no tail** β only 8% spread from p50 to p99. MLX: 500 consecutive frames during stable operation. MPS: 102K-frame manifest history. Live numbers from [Perceptasia](https://github.com/lyonsno/perceptasia). | |
| ### Isolated model benchmark | |
| | Backend | p50 | p90 | min | FPS | | |
| |---|---|---|---|---| | |
| | **MLX (wilor-mlx)** | **36 ms** | **36 ms** | **36 ms** | **28** | | |
| | PyTorch MPS (2.5.0) | 50 ms | 51 ms | 49 ms | 20 | | |
| 1.4x faster in pure model compute. Same deterministic input, 100 iterations after 30 warmup, batched timing. | |
| The advantage also reproduced on a lower-bandwidth M2 Pro validation box: across 80 archived hand-positive camera frames, MLX model-call p50/p90/p95 was 252/355/418ms versus PyTorch MPS 358/490/571ms. A reversed-order audit (PyTorch MPS running first) confirmed the result. | |
| ### Quantization impact on speed | |
| | Variant | p50 | FPS | Notes | | |
| |---|---|---|---| | |
| | float32 | 36 ms | 28 | Reference | | |
| | float16 | 36 ms | 28 | Equal ALU throughput on M4 Max | | |
| | int4 | 37 ms | 27 | Dequant overhead β bandwidth savings | | |
| On Apple Silicon, float16 and int4 do not improve latency for this model size (210 tokens Γ 1280 dim). The GPU is compute-overhead-bound, not bandwidth-bound. Int4's value is purely download size reduction (2.4 GB β 490 MB). | |
| ## Numerical Accuracy | |
| Compared against PyTorch WiLoR-mini on identical float32 inputs: | |
| | Variant | pred_vertices max diff | pred_keypoints_3d max diff | | |
| |---|---|---| | |
| | float32 | 0.006 (sub-mm) | 0.006 (sub-mm) | | |
| | int4 | 0.061 (< 1mm) | 0.059 (< 1mm) | | |
| Both are within visual tolerance for real-time hand tracking. | |
| ## Quick Start | |
| ```python | |
| from wilor_mlx import WiLoR | |
| import mlx.core as mx | |
| # Everything downloads and caches automatically | |
| # First run requires torch for one-time MANO conversion; after that, torch is not used | |
| model = WiLoR.from_pretrained() | |
| # Inference | |
| image = mx.array(your_256x256_hand_crop) # (1, 256, 256, 3) uint8 | |
| result = model(image) | |
| mx.eval(result) | |
| keypoints = result['pred_keypoints_3d'] # (1, 21, 3) | |
| vertices = result['pred_vertices'] # (1, 778, 3) | |
| ``` | |
| See [github.com/lyonsno/wilor-mlx](https://github.com/lyonsno/wilor-mlx) for full documentation. | |
| ## Architecture | |
| - **ViT-H/16 backbone:** 1280 embed dim, 32 layers, 16 heads, 210 tokens (192 patches + 18 learnable) | |
| - **MANO hand model:** 778 vertices, 16 joints, Linear Blend Skinning with kinematic chain | |
| - **RefineNet:** Multi-scale deconvolution + bilinear grid sampling + MANO parameter refinement | |
| - **Total parameters:** ~610M | |
| ## Citation | |
| ```bibtex | |
| @article{zhan2024wilor, | |
| title={WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild}, | |
| author={Zhan, Rolandos Alexandros and others}, | |
| year={2024} | |
| } | |
| ``` | |
| ## License | |
| The wilor-mlx code and these weight files are MIT licensed. The weights contain only ViT backbone, RefineNet, and learned embedding parameters β no MANO data is bundled or rehosted. | |
| The [MANO hand model](https://mano.is.tue.mpg.de/) is separately licensed by the Max Planck Institute. `WiLoR.from_pretrained()` fetches upstream [WiLoR-mini](https://huggingface.co/warmshao/WiLoR-mini) assets and converts MANO data locally on your machine. You can also supply your own MANO data via `mano_path=...`. | |