TurboPrefill

TurboPrefill RTX 5060 Ti

GitHub Repository

Multi-GPU prefill acceleration for llama.cpp.

TurboPrefill is an experimental scheduling modification for llama.cpp designed to improve long-context prefill throughput in multi-GPU layer-split configurations.

Key Results

  • Up to 2.23× faster prefill
  • Tested with GPT-OSS-120B
  • No changes to model outputs
  • Decode path remains unchanged

Tested Multi-GPU Platforms

TurboPrefill is based on general multi-GPU scheduling principles and has been tested across multiple NVIDIA GPU generations and cluster sizes.

  • 8× NVIDIA RTX 5060 Ti 16GB (Blackwell architecture, 2025)
  • 4× NVIDIA RTX 3090 (Ampere architecture, 2020)
  • 10× NVIDIA P104-100 (Pascal architecture, 2016)
  • TurboPrefill has been successfully tested across three NVIDIA GPU generations spanning nearly a decade of hardware development.

Additional Validation

TurboPrefill Pascal

Results were also reproduced on Pascal-generation hardware using multi-GPU P104-100 systems.

Project Status

Public release v1.0.0.

TurboPrefill is an experimental open-source optimization for llama.cpp focused on accelerating long-context multi-GPU prefill workloads.

GitHub Repository

https://github.com/sergey-automation/TurboPrefill

Industrial Systems Architect: Serhii Trykhlieb

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support