ONNX/TFLite — the mobile inference formats

#30

by 3morixd - opened 5 days ago

We test models in both GGUF (llama.cpp) and ONNX/TFLite formats on our phone farm.

Findings: ONNX Runtime is faster for small models (<500M) on Snapdragon, while GGUF/llama.cpp is better for larger models (1B+) due to memory-mapped loading.

The choice of format matters as much as the choice of model. We benchmark both at dispatchAI.

Dispatch AI (FZE), Sharjah UAE

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment