Spaces:

mlx-community
/

README

Running

App Files Files Community

Is MTP the missing piece for Apple Silicon LLM inference?

#34

by ak959 - opened Jun 10

Discussion

ak959

Jun 10

•

edited Jun 10

We've been experimenting with Multi-Token Prediction (MTP) on Apple Silicon and have seen some surprisingly large performance gains with recent models such as Gemma 4.

This got us wondering:

How much of the current inference bottleneck on Apple Silicon is actually due to the decoding process itself?

MLX has already done an excellent job leveraging unified memory and Metal acceleration, but decoding remains one of the most expensive stages for LLM inference. MTP and speculative-style decoding approaches seem promising, especially as models become larger and context windows continue to grow.

Some questions I'd love to hear opinions on:

Have you experimented with MTP or speculative decoding on MLX?
Which models benefit the most?

Gemma 4
Qwen 3.6
DeepSeek
Llama family

Do you think future Apple Silicon inference stacks should prioritize:

Better quantization
MTP / speculative decoding
KV cache optimization
Multi-GPU / distributed inference

For those running M3 Ultra or M4 Max systems, what are your current throughput numbers for Gemma 4 or Qwen 3.6?

I'm particularly interested in real-world experiences from people building inference runtimes on top of MLX.

It feels like there is still significant headroom left in Apple Silicon inference that isn't being fully explored yet.

Curious to hear what others are seeing.

usermma

Jun 10

not all people interested in MTP and speculative decoding due to loss of capabilities of the model...

or at least thats what alot of people thought more over of...

datagram

MLX Community org Jun 11

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

usermma

Jun 11

is this like this ai who said;

user: who are you?

assistant: I'm just to help out. How can I assist if anyone doesn't understand what my goal is, and not have an actual purpose on this task so it's easier for me than everyone else in the whole universe that could ever find anything better! So here am: myself (or more likely) or any person of you with a deep heart to answer your questions.

me: why did you say at the end "all 4" and what do you mean by it?

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

datagram

MLX Community org Jun 11

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

usermma

Jun 11

yeah, one device is always better, if they can merge them in one, its much better than running a cluster, beacuse one device is much faster than alot of devices working together....

ak959

Jun 11

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

ak959 changed discussion status to closed Jun 11

ak959

Jun 11

not all people interested in MTP and speculative decoding due to loss of capabilities of the model...

or at least thats what alot of people thought more over of...

MLX-LM has not utilize the GPU
Test https://github.com/defai-digital/ax-engine
it is much faster even in direct mode

ak959

Jun 11

chips are scarce, so from an open access perspective (the little guy), Multi-GPU / distributed inference should be the last priority. Priority should be on 'better quantization' and 'key value caching,' MTP improves AI lifestyle (e.g., better predict = better thinking and output). If you are a F1000 corp, you can buy anything so multi gpu/distributed is the priority, but maybe you are just buying proprietary "frontier" models and their kickers like oai for healthcare or anthropic for mines and minerals. final answer = all 4.

yes, you are correct.
But i think Mac MLX is doing something
https://www.youtube.com/watch?v=wykPErJ8M-8

ak959

Jun 11

just meant that all four matter, but prioritize 1-4 the OP asks? ...imho it depends on if you or someone you care about has resources to buy multiple GPUs or not-- if you consider universal access to compute an important principle. if so, optimize for one machine/resources FIRST.

if not, so be it, optimize for chip clusters and parallelization FIRST. e.g., big business SOCs or mom n pop inference providers in your hometown, stack & rack 10 mac minis and serve the mlx-based inference. my experience is i don't know that many peeps with mac mini racks or multiple macstudios in parallel, irl.

Thank you for your advise. can you please share your parallelization experience?

ak959 changed discussion status to open Jun 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment