<think> block not generated / requires manual chat template modification
The model fails to generate the <think> block header as expected. Initially suspected this was related to quantization, but switching quantization formats did not resolve the issue.
Environment:
Image:
ghcr.io/ggml-org/llama.cpp:server-cuda13-b8589- Also tested latest
ghcr.io/ggml-org/llama.cpp:server-cuda13, but unsure if the issue is related to recent changes (possibly around Gemma-4 template handling). Switched to an older image (pre-Gemma-4) to rule that out.
GPU: CUDA GB10
Quantizations tested:
UD-IQ3_XXSUD-Q2_K_XL
Reproduction Steps:
Run the server using the following configuration:
docker run -d --rm \
--name minimax-m25 \
--gpus all \
-p 8080:8080 \
-v /raid/models:/models \
ghcr.io/ggml-org/llama.cpp:server-cuda13-b8589 \
-m /models/MiniMax-M2.5-UD-IQ3_XXS/MiniMax-M2.5-UD-IQ3_XXS-00001-of-00003.gguf \
--alias minimax-m25 \
--n-gpu-layers all \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 196608 \
--batch-size 8192 \
--ubatch-size 2048 \
--threads 20 \
--threads-batch 20 \
--numa isolate \
--no-mmap \
--flash-attn on \
--host 0.0.0.0
Observed Behavior:
- The
<think>block header is not generated during inference. - Changing quantization from
UD-IQ3_XXStoUD-Q2_K_XLdoes not resolve the issue.
Workaround / Fix:
Manually enabling and overriding the chat template resolves the issue:
--jinja \
--chat-template-file /models/MiniMax-M2.5-UD-IQ3_XXS/chat_template.jinja
Modification made in chat_template.jinja (lines ~157β159):
Original:
{%- if add_generation_prompt -%}
{{- ']~b]ai' ~ '\n' ~ '<think>' ~ '\n' }}
{%- endif -%}
Modified:
{%- if add_generation_prompt -%}
{{- ']~b]ai' ~ '\n' }}
{%- endif -%}
After removing <think> from the template, generation behaves as expected.
Notes:
- Unclear if this is model-specific, template-specific, or a broader issue with how
<think>is handled. - Reporting in case others encounter similar behavior or if this indicates a mismatch between model expectations and template defaults.
Thanks for sharing the fix! I used the model for awhile without any problems. I think the bug was introduced recently. llama.cpp added some changes in the template processing.
Thanks for sharing the fix! I used the model for awhile without any problems. I think the bug was introduced recently.
llama.cppadded some changes in the template processing.
Indeed it seems like a problem with llama.cpp
I also created an issue on their github repository
https://github.com/ggml-org/llama.cpp/issues/21465
Here's the link if you wanna follow up