<think> block not generated / requires manual chat template modification

#11
by AiverAiva - opened

The model fails to generate the <think> block header as expected. Initially suspected this was related to quantization, but switching quantization formats did not resolve the issue.

Environment:

  • Image:

    • ghcr.io/ggml-org/llama.cpp:server-cuda13-b8589
    • Also tested latest ghcr.io/ggml-org/llama.cpp:server-cuda13, but unsure if the issue is related to recent changes (possibly around Gemma-4 template handling). Switched to an older image (pre-Gemma-4) to rule that out.
  • GPU: CUDA GB10

  • Quantizations tested:

    • UD-IQ3_XXS
    • UD-Q2_K_XL

Reproduction Steps:
Run the server using the following configuration:

docker run -d --rm \
  --name minimax-m25 \
  --gpus all \
  -p 8080:8080 \
  -v /raid/models:/models \
  ghcr.io/ggml-org/llama.cpp:server-cuda13-b8589 \
  -m /models/MiniMax-M2.5-UD-IQ3_XXS/MiniMax-M2.5-UD-IQ3_XXS-00001-of-00003.gguf \
  --alias minimax-m25 \
  --n-gpu-layers all \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --ctx-size 196608 \
  --batch-size 8192 \
  --ubatch-size 2048 \
  --threads 20 \
  --threads-batch 20 \
  --numa isolate \
  --no-mmap \
  --flash-attn on \
  --host 0.0.0.0

Observed Behavior:

  • The <think> block header is not generated during inference.
  • Changing quantization from UD-IQ3_XXS to UD-Q2_K_XL does not resolve the issue.

Workaround / Fix:
Manually enabling and overriding the chat template resolves the issue:

--jinja \
--chat-template-file /models/MiniMax-M2.5-UD-IQ3_XXS/chat_template.jinja

Modification made in chat_template.jinja (lines ~157–159):

Original:

{%- if add_generation_prompt -%}
{{- ']~b]ai' ~ '\n' ~ '<think>' ~ '\n' }}
{%- endif -%}

Modified:

{%- if add_generation_prompt -%}
{{- ']~b]ai' ~ '\n' }}
{%- endif -%}

After removing <think> from the template, generation behaves as expected.

Notes:

  • Unclear if this is model-specific, template-specific, or a broader issue with how <think> is handled.
  • Reporting in case others encounter similar behavior or if this indicates a mismatch between model expectations and template defaults.

Thanks for sharing the fix! I used the model for awhile without any problems. I think the bug was introduced recently. llama.cpp added some changes in the template processing.

Thanks for sharing the fix! I used the model for awhile without any problems. I think the bug was introduced recently. llama.cpp added some changes in the template processing.

Indeed it seems like a problem with llama.cpp
I also created an issue on their github repository
https://github.com/ggml-org/llama.cpp/issues/21465
Here's the link if you wanna follow up

Sign up or log in to comment