tags:
- specprefill
- llm-inference
- apple-silicon
- mlx
- sparse-prefill
- ttft-optimization
license: mit
SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon
Author: David Green (@Thump604)
Paper: specprefill.pdf | Source: specprefill.tex
Related: vllm-mlx PR #180 | Issue #179
Abstract
Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.
SpecPrefill uses a small draft model to score prompt token importance via attention patterns, then sparse-prefills only the selected tokens into the target model while preserving original positional encoding via manual RoPE. On Apple Silicon's unified memory, draft scoring incurs zero data-movement cost, reducing the system to a pure FLOP-ratio problem.
On M2 Ultra (128 GB unified memory), SpecPrefill with a 2B draft model reduces TTFT by 3.71–5.45× across 8K–128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3 minutes to 3.5 minutes. Cross-architecture validation on Nemotron-H 120B (Mamba-2/Attention hybrid, 2.10–2.19×) and GPT-OSS 120B (sliding-window MoE, 1.24–1.28×) confirms the ratio thesis: the draft-to-target FLOP ratio is the dominant predictor of speedup on unified memory.
Key Results
| Model | Draft | 8K | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|---|
| Qwen3.5-122B (MoE) | 2B | 3.71× | 4.11× | 4.23× | 4.50× | 5.45× |
| Qwen3.5-35B (MoE) | 4B | 1.81× | 1.86× | 1.85× | 1.84× | — |
| Nemotron-H 120B (hybrid) | Nano-4B | 2.10× | 2.17× | 2.19× | 2.19× | — |
| GPT-OSS 120B (dense) | 20B | 1.24× | 1.28× | — | — | — |
All measurements at 20% keep rate on M2 Ultra 128 GB. Qwen3.5-122B numbers are 5-trial means.
Citation
If you use this work, please cite:
@misc{green2026specprefill,
title={SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon},
author={David Green},
year={2026},
doi={10.5281/zenodo.19120919},
url={https://doi.org/10.5281/zenodo.19120919}
}