TurboPrefill
Multi-GPU prefill acceleration for llama.cpp.
TurboPrefill is an experimental scheduling modification for llama.cpp designed to improve long-context prefill throughput in multi-GPU layer-split configurations.
Key Results
- Up to 2.23× faster prefill
- Tested with GPT-OSS-120B
- No changes to model outputs
- Decode path remains unchanged
Tested Multi-GPU Platforms
TurboPrefill is based on general multi-GPU scheduling principles and has been tested across multiple NVIDIA GPU generations and cluster sizes.
- 8× NVIDIA RTX 5060 Ti 16GB (Blackwell architecture, 2025)
- 4× NVIDIA RTX 3090 (Ampere architecture, 2020)
- 10× NVIDIA P104-100 (Pascal architecture, 2016)
- TurboPrefill has been successfully tested across three NVIDIA GPU generations spanning nearly a decade of hardware development.
Additional Validation
Results were also reproduced on Pascal-generation hardware using multi-GPU P104-100 systems.
Project Status
Public release v1.0.0.
TurboPrefill is an experimental open-source optimization for llama.cpp focused on accelerating long-context multi-GPU prefill workloads.
GitHub Repository
https://github.com/sergey-automation/TurboPrefill
Industrial Systems Architect: Serhii Trykhlieb
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

