Request
#14
by mlboydaisuke - opened
We're running Gemma 4 E2B on iPhone Neural Engine via CoreML at
28 tok/s decode (99.78% of ops on ANE, verified via MLComputePlan).
MTP heads would let us implement speculative decoding and push
toward 40-50+ tok/s on-device.
The MTP architecture (lightweight prediction heads on top of the
final hidden state) adds only ~1-3% parameter overhead and is
critical for practical on-device deployment where every tok/s
matters for UX.
What we need
- The trained MTP head weights (however many heads were used)
- Any associated config (number of heads, architecture details)
Google's LiteRT deployment uses these heads internally, so they
exist — just not in the public release. Releasing them would
significantly benefit the on-device inference community.
Reference
- Our project: https://github.com/john-rocky/CoreML-LLM
- iPhone ANE deployment: 28 tok/s, ~1GB memory, 99.78% ANE