SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon

Author: David Green (@Thump604)

Paper: specprefill.pdf | Source: specprefill.tex

DOI: 10.5281/zenodo.19120919

Related: vllm-mlx PR #180 | Issue #179

Abstract

Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.

SpecPrefill uses a small draft model to score prompt token importance via attention patterns, then sparse-prefills only the selected tokens into the target model while preserving original positional encoding via manual RoPE. On Apple Silicon's unified memory, draft scoring incurs zero data-movement cost, reducing the system to a pure FLOP-ratio problem.

On M2 Ultra (128 GB unified memory), SpecPrefill with a 2B draft model reduces TTFT by 3.71–5.45× across 8K–128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3 minutes to 3.5 minutes. Cross-architecture validation on Nemotron-H 120B (Mamba-2/Attention hybrid, 2.10–2.19×) and GPT-OSS 120B (sliding-window MoE, 1.24–1.28×) confirms the ratio thesis: the draft-to-target FLOP ratio is the dominant predictor of speedup on unified memory.

Key Results

Model	Draft	8K	16K	32K	64K	128K
Qwen3.5-122B (MoE)	2B	3.71×	4.11×	4.23×	4.50×	5.45×
Qwen3.5-35B (MoE)	4B	1.81×	1.86×	1.85×	1.84×	—
Nemotron-H 120B (hybrid)	Nano-4B	2.10×	2.17×	2.19×	2.19×	—
GPT-OSS 120B (dense)	20B	1.24×	1.28×	—	—	—

All measurements at 20% keep rate on M2 Ultra 128 GB. Qwen3.5-122B numbers are 5-trial means.

Citation

If you use this work, please cite:

@misc{green2026specprefill,
  title={SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon},
  author={David Green},
  year={2026},
  doi={10.5281/zenodo.19120919},
  url={https://doi.org/10.5281/zenodo.19120919}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support