SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon

Author: David Green (@Thump604)

Paper: specprefill.pdf | Source: specprefill.tex

DOI: 10.5281/zenodo.19120919

Related: vllm-mlx PR #180 | Issue #179

Abstract

Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.

SpecPrefill uses a small draft model to score prompt token importance via attention patterns, then sparse-prefills only the selected tokens into the target model while preserving original positional encoding via manual RoPE. On Apple Silicon's unified memory, draft scoring incurs zero data-movement cost, reducing the system to a pure FLOP-ratio problem.

On M2 Ultra (128 GB unified memory), SpecPrefill with a 2B draft model reduces TTFT by 3.71–5.45Γ— across 8K–128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3 minutes to 3.5 minutes. Cross-architecture validation on Nemotron-H 120B (Mamba-2/Attention hybrid, 2.10–2.19Γ—) and GPT-OSS 120B (sliding-window MoE, 1.24–1.28Γ—) confirms the ratio thesis: the draft-to-target FLOP ratio is the dominant predictor of speedup on unified memory.

Key Results

Model Draft 8K 16K 32K 64K 128K
Qwen3.5-122B (MoE) 2B 3.71Γ— 4.11Γ— 4.23Γ— 4.50Γ— 5.45Γ—
Qwen3.5-35B (MoE) 4B 1.81Γ— 1.86Γ— 1.85Γ— 1.84Γ— β€”
Nemotron-H 120B (hybrid) Nano-4B 2.10Γ— 2.17Γ— 2.19Γ— 2.19Γ— β€”
GPT-OSS 120B (dense) 20B 1.24Γ— 1.28Γ— β€” β€” β€”

All measurements at 20% keep rate on M2 Ultra 128 GB. Qwen3.5-122B numbers are 5-trial means.

Citation

If you use this work, please cite:

@misc{green2026specprefill,
  title={SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon},
  author={David Green},
  year={2026},
  doi={10.5281/zenodo.19120919},
  url={https://doi.org/10.5281/zenodo.19120919}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support