--- license: other license_name: nvidia-license license_link: https://huggingface.co/nvidia/LocateAnything-3B pipeline_tag: object-detection tags: - coreml - vision - object-detection - image-localization - apple-silicon --- # LocateAnything-3B CoreML CoreML packages and a lightweight Python runner for image localization on Apple hardware. Author: devin-lai ## CoreML Inference Performance This CoreML build is tuned for local macOS inference, where fast repeat runs matter more than model startup time. On an M5 Mac with 32GB memory, the optimized CoreML path improves the post-load inference workflow and makes the biggest jump in the decoder stage. Benchmark setup: - Device: macOS M5, 32GB memory - Model: LocateAnything-3B - Input image: 1536x1024 - Categories: person, car - Comparison focus: inference time after model loading | Metric | CoreML Optimized | PyTorch MPS bf16 | Improvement | |---|---:|---:|---:| | Post-load inference time | 11.7s | 12.7s | ~1.1x faster | | Generation time | 7.64s | 12.56s | ~1.6x faster | | Prefill time | 1.72s | 7.97s | ~4.6x faster | | Tokens per second | 17.55 TPS | 10.35 TPS | ~1.7x higher throughput | For anyone running vision-language localization directly on a Mac, the practical win is lower wait time after the packages are loaded. CoreML reduces post-load inference from 12.7s to 11.7s, while the generation path improves from 12.56s to 7.64s. The standout improvement is prefill: 7.97s drops to 1.72s, a roughly 4.6x speedup. Throughput also rises from 10.35 TPS to 17.55 TPS, making local decoding noticeably smoother for repeated image queries. In short, this CoreML version delivers faster local inference, much faster prefill, and higher decoding throughput for macOS users who want the model running close to the metal. ## Contents - `LocateAnything-vision.mlpackage` - image encoder package - `LocateAnything-embed.mlpackage` - token embedding package - `LocateAnything-decoder.mlpackage` - decoder package - `LocateAnything-assets/` - tokenizer and runtime configuration - `run_locateanything_image_coreml.py` - still-image runner - `test.png` - sample input ## Setup ```bash pip install -r requirements.txt ``` ## Example ```bash python run_locateanything_image_coreml.py \ --input test.png \ --categories "person,car" ``` By default, the script writes: - `test.coreml.annotated.png` - `test.coreml.detections.json` ## Notes The packages are configured for the image grid stored in the vision package metadata. Use the bundled assets directory with these packages so token ids and runtime limits stay aligned. The license follows the upstream NVIDIA LocateAnything-3B terms linked in the metadata above.