Any tips to speed up inference?
Hi all,
and thanks Microsoft for this amazing model!
I'm using huggingface pipeline to inference from the phi-4.
However, it feels really slow using 14B model.
I'm using 4 gpus Tesla V100 (32GB each) to distribute the model in inference time.
Is there a quick and easy way to make the model faster in inference?
It would be awesome if it's some kind of parameter in a pipeline function.
Thanks !
Lino Hong.
Using llama.cpp is an excellent way to speed up inference for large language models like Phi-4, especially if you want to run the model efficiently on CPUs or even GPUs with minimal overhead. llama.cpp is optimized for inference and supports quantization, which can significantly reduce the model size and improve speed without a large drop in accuracy.
Here’s how you can use llama.cpp with your Phi-4 model:
1. Convert the Model to GGUF Format
llama.cpp uses the GGUF format for models. You need to convert your Hugging Face model to this format.
Steps:
Clone the
llama.cpprepository:git clone https://github.com/ggerganov/llama.cpp cd llama.cppInstall dependencies:
makeConvert the Hugging Face model to GGUF format:
- First, install the required Python dependencies:
pip install torch transformers - Use the
convert-hf-to-gguf.pyscript provided inllama.cpp:
Replacepython3 convert-hf-to-gguf.py --model /path/to/phi-4 --outfile /path/to/phi-4-gguf/path/to/phi-4with the path to your Hugging Face model and/path/to/phi-4-ggufwith the desired output path.
- First, install the required Python dependencies:
2. Quantize the Model (Optional but Recommended)
Quantization reduces the model size and speeds up inference. llama.cpp supports several quantization levels (e.g., q4_0, q4_1, q5_0, etc.).
Steps:
- Run the quantization script:
This will create a quantized version of the model at./quantize /path/to/phi-4-gguf /path/to/phi-4-gguf-q4_0 q4_0/path/to/phi-4-gguf-q4_0.
3. Run Inference with llama.cpp
Once the model is converted and optionally quantized, you can run inference using llama.cpp.
Steps:
Run the
mainexecutable:./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time"-m: Path to the GGUF model.-p: Prompt for inference.
For GPU acceleration (if supported), use the
--n-gpu-layersflag:./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --n-gpu-layers 20Replace
20with the number of layers you want to offload to the GPU.
4. Advanced Options
- Batch Size: Use the
-bflag to set the batch size. - Threads: Use the
-tflag to specify the number of CPU threads. - Temperature and Top-p Sampling: Use
--tempand--top-pfor better control over text generation.
Example:
./main -m /path/to/phi-4-gguf-q4_0 -p "Once upon a time" --temp 0.7 --top-p 0.9 -t 8 -b 512 --n-gpu-layers 20
5. Benchmarking
To measure performance, use the --benchmark flag:
./main -m /path/to/phi-4-gguf-q4_0 --benchmark
6. Using llama.cpp in Python
If you prefer to use llama.cpp in a Python script, you can use the llama-cpp-python package.
Steps:
Install the package:
pip install llama-cpp-pythonLoad and run the model:
from llama_cpp import Llama # Load the model llm = Llama(model_path="/path/to/phi-4-gguf-q4_0", n_gpu_layers=20) # Run inference output = llm("Once upon a time", max_tokens=50) print(output["choices"][0]["text"])
Benefits of Using llama.cpp
- Efficiency: Optimized for CPU and GPU inference.
- Quantization: Reduces model size and speeds up inference.
- Portability: Runs on a wide range of hardware, including CPUs and GPUs.
- Minimal Dependencies: Lightweight and easy to set up.
By using llama.cpp, you can achieve faster inference times and lower resource usage compared to running the model directly through Hugging Face Transformers.
A few tips for faster inference:
- Use a hosted API instead of running locally — providers optimize for throughput with batching and better hardware
- Reduce max_tokens if you do not need long outputs
- Use streaming to get first tokens faster
If you want to try Phi-4 via API without managing infrastructure, Crazyrouter provides access to 600+ models including Phi-4 through an OpenAI-compatible endpoint. Typically faster than self-hosting on consumer hardware.