Is MTP the missing piece for Apple Silicon LLM inference?

#34
by ak959 - opened

We've been experimenting with Multi-Token Prediction (MTP) on Apple Silicon and have seen some surprisingly large performance gains with recent models such as Gemma 4.

This got us wondering:

How much of the current inference bottleneck on Apple Silicon is actually due to the decoding process itself?

MLX has already done an excellent job leveraging unified memory and Metal acceleration, but decoding remains one of the most expensive stages for LLM inference. MTP and speculative-style decoding approaches seem promising, especially as models become larger and context windows continue to grow.

Some questions I'd love to hear opinions on:

Have you experimented with MTP or speculative decoding on MLX?
Which models benefit the most?

  • Gemma 4
  • Qwen 3.6
  • DeepSeek
  • Llama family

Do you think future Apple Silicon inference stacks should prioritize:

  • Better quantization
  • MTP / speculative decoding
  • KV cache optimization
  • Multi-GPU / distributed inference

For those running M3 Ultra or M4 Max systems, what are your current throughput numbers for Gemma 4 or Qwen 3.6?

I'm particularly interested in real-world experiences from people building inference runtimes on top of MLX.

It feels like there is still significant headroom left in Apple Silicon inference that isn't being fully explored yet.

Curious to hear what others are seeing.
Screenshot 2026-06-09 at 21.11.49

Screenshot 2026-06-09 at 21.12.24

MLX Community org

not all people interested in MTP and speculative decoding due to loss of capabilities of the model...

or at least thats what alot of people thought more over of...

Sign up or log in to comment