TurboPrefill

Multi-GPU prefill acceleration for llama.cpp.

TurboPrefill is an experimental scheduling modification for llama.cpp designed to improve long-context prefill throughput in multi-GPU layer-split configurations.

Key Results

Up to 2.23× faster prefill
Tested with GPT-OSS-120B
No changes to model outputs
Decode path remains unchanged

Tested Multi-GPU Platforms

TurboPrefill is based on general multi-GPU scheduling principles and has been tested across multiple NVIDIA GPU generations and cluster sizes.

8× NVIDIA RTX 5060 Ti 16GB (Blackwell architecture, 2025)
4× NVIDIA RTX 3090 (Ampere architecture, 2020)
10× NVIDIA P104-100 (Pascal architecture, 2016)
TurboPrefill has been successfully tested across three NVIDIA GPU generations spanning nearly a decade of hardware development.

Additional Validation

Results were also reproduced on Pascal-generation hardware using multi-GPU P104-100 systems.

Project Status

Public release v1.0.0.

TurboPrefill is an experimental open-source optimization for llama.cpp focused on accelerating long-context multi-GPU prefill workloads.

GitHub Repository

https://github.com/sergey-automation/TurboPrefill

Industrial Systems Architect: Serhii Trykhlieb

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support