Request

#14

by mlboydaisuke - opened Apr 10

Apr 10

We're running Gemma 4 E2B on iPhone Neural Engine via CoreML at
28 tok/s decode (99.78% of ops on ANE, verified via MLComputePlan).
MTP heads would let us implement speculative decoding and push
toward 40-50+ tok/s on-device.

The MTP architecture (lightweight prediction heads on top of the
final hidden state) adds only ~1-3% parameter overhead and is
critical for practical on-device deployment where every tok/s
matters for UX.

What we need

The trained MTP head weights (however many heads were used)
Any associated config (number of heads, architecture details)

Google's LiteRT deployment uses these heads internally, so they
exist — just not in the public release. Releasing them would
significantly benefit the on-device inference community.

Reference

Our project: https://github.com/john-rocky/CoreML-LLM
iPhone ANE deployment: 28 tok/s, ~1GB memory, 99.78% ANE

pannaga10

Google org May 19

Please check this out https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment