ONNX/TFLite β€” the mobile inference formats

#30
by 3morixd - opened

We test models in both GGUF (llama.cpp) and ONNX/TFLite formats on our phone farm.

Findings: ONNX Runtime is faster for small models (<500M) on Snapdragon, while GGUF/llama.cpp is better for larger models (1B+) due to memory-mapped loading.

The choice of format matters as much as the choice of model. We benchmark both at dispatchAI.

  • Dispatch AI (FZE), Sharjah UAE

Sign up or log in to comment