Qwen 2.5 Coder 7B Private AI Engine

An optimized, high-performance C++ inference engine using llama.cpp and FastAPI to serve Qwen 2.5 Coder 7B Instruct GGUF at ultra-low latency.

πŸš€ Key Features

  • Ultra-Low Latency: Optimized context sizes and thread scheduling tailored for vCPU containers.
  • SSE Token Streaming: Sub-50ms first-token response times.
  • FIM Autocomplete: Inline completions under 100ms.
  • Safe I/O: Uses DEVNULL to bypass pipe buffer freezes.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support