Request

#14
by mlboydaisuke - opened

We're running Gemma 4 E2B on iPhone Neural Engine via CoreML at
28 tok/s decode (99.78% of ops on ANE, verified via MLComputePlan).
MTP heads would let us implement speculative decoding and push
toward 40-50+ tok/s on-device.

The MTP architecture (lightweight prediction heads on top of the
final hidden state) adds only ~1-3% parameter overhead and is
critical for practical on-device deployment where every tok/s
matters for UX.

What we need

  • The trained MTP head weights (however many heads were used)
  • Any associated config (number of heads, architecture details)

Google's LiteRT deployment uses these heads internally, so they
exist — just not in the public release. Releasing them would
significantly benefit the on-device inference community.

Reference

Sign up or log in to comment