SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon
Author: David Green (@Thump604)
Paper: specprefill.pdf | Source: specprefill.tex
Related: vllm-mlx PR #180 | Issue #179
Abstract
Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.
SpecPrefill uses a small draft model to score prompt token importance via attention patterns, then sparse-prefills only the selected tokens into the target model while preserving original positional encoding via manual RoPE. On Apple Silicon's unified memory, draft scoring incurs zero data-movement cost, reducing the system to a pure FLOP-ratio problem.
On M2 Ultra (128 GB unified memory), SpecPrefill with a 2B draft model reduces TTFT by 3.71β5.45Γ across 8Kβ128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3 minutes to 3.5 minutes. Cross-architecture validation on Nemotron-H 120B (Mamba-2/Attention hybrid, 2.10β2.19Γ) and GPT-OSS 120B (sliding-window MoE, 1.24β1.28Γ) confirms the ratio thesis: the draft-to-target FLOP ratio is the dominant predictor of speedup on unified memory.
Key Results
| Model | Draft | 8K | 16K | 32K | 64K | 128K |
|---|---|---|---|---|---|---|
| Qwen3.5-122B (MoE) | 2B | 3.71Γ | 4.11Γ | 4.23Γ | 4.50Γ | 5.45Γ |
| Qwen3.5-35B (MoE) | 4B | 1.81Γ | 1.86Γ | 1.85Γ | 1.84Γ | β |
| Nemotron-H 120B (hybrid) | Nano-4B | 2.10Γ | 2.17Γ | 2.19Γ | 2.19Γ | β |
| GPT-OSS 120B (dense) | 20B | 1.24Γ | 1.28Γ | β | β | β |
All measurements at 20% keep rate on M2 Ultra 128 GB. Qwen3.5-122B numbers are 5-trial means.
Citation
If you use this work, please cite:
@misc{green2026specprefill,
title={SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon},
author={David Green},
year={2026},
doi={10.5281/zenodo.19120919},
url={https://doi.org/10.5281/zenodo.19120919}
}