New in llama.cpp: Model Management

Team Article Published December 11, 2025

Upvote

133

llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.

Quick Start

Start the server in router mode by not specifying a model:

llama-server

This auto-discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you've previously downloaded models via llama-server -hf user/model, they'll be available automatically.

You can also point to a local directory of GGUF files:

llama-server --models-dir ./my-models

Features

Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files
On-demand loading: Models load automatically when first requested
LRU eviction: When you hit --models-max (default: 4), the least-recently-used model unloads
Request routing: The model field in your request determines which model handles it

Examples

Chat with a specific model

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

On the first request, the server automatically loads the model into memory (loading time depends on model size). Subsequent requests to the same model are instant since it's already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models with their status (loaded, loading, or unloaded).

Manually load a model

curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Unload a model to free VRAM

curl -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Key Options

Flag	Description
`--models-dir PATH`	Directory containing your GGUF files
`--models-max N`	Max models loaded simultaneously (default: 4)
`--no-models-autoload`	Disable auto-loading; require explicit `/models/load` calls

All model instances inherit settings from the router:

llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use 8192 context and full GPU offload. You can also define per-model settings using presets:

llama-server --models-preset config.ini

[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7

Also available in the Web UI

The built-in web UI also supports model switching. Just select a model from the dropdown and it loads automatically.

Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

Using OCR models with llama.cpp

April 10, 2026

New in llama.cpp: Anthropic Messages API

January 19, 2026

Community

bukit

Dec 11, 2025

Mmproj support?

sbeltz

Dec 12, 2025

Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.

sbeltz

Dec 12, 2025

Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.

xbruce22

Dec 12, 2025

This is good addition, Thank you.

etemiz

Dec 12, 2025

•

edited Dec 12, 2025

what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.

i checked the llama-cpp-python but it does not have that.

xbruce22

Dec 16, 2025

llama-server by default in most implementation keeps the reasoning content in reasoning_content variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure tokens

razvanab

Dec 13, 2025

Now I can use llama.cpp all the time. A big thank you to the devs.

sbeltz

Dec 13, 2025

Is there currently a way to have a "default" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)

milksteak1111

Jan 14

This seems to work

[DEFAULT]
port = 8080
n-gpu-layers = -1
device = 0
flash-attn = on
chat-template = jinja
models-max = 4

eribob

Dec 14, 2025

Does it unload the current model if VRAM is full, to allow swapping to a new model?

21world

Dec 15, 2025

fun ideas , add personal avatar and p2p social network also emule p2p models storage

21world

Dec 15, 2025

This comment has been hidden (marked as Off-Topic)

JLouisBiz

Dec 26, 2025

Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh

It's a win for sure.