Thinking mode and Effort

#3
by vkhaitan1 - opened

Sarvam Models have Option of enable_thinking and reasoning_effort.
But it doesn't work with this GGUF with the patch you created.
Can you find out why ? Any template issue ?

Fixed and updated all the quants, Thanks for flagging.

Unfortunately, I tried the new quants, but still doesn't work.
-rea off still shows thinking happening!
I have my own filters set in openwebui for enable_thinking and it works for Qwen/GLM models, but not for this one. So, still some other fixes are required somewhere.

Forgot to tell that I have rebased your PR to latest master commit.

Owner
β€’
edited Mar 18

Thanks for testing the new quants and reporting back!

This is actually an upstream issue with the base sarvam-30b model itself, not something introduced by the GGUF conversion or quantization. The same behavior has been reported on the original model repo β€” see sarvamai/sarvam-30b#11. Even when using enable_thinking=False with vLLM/Transformers on the original safetensors weights, the model still produces <think> tokens.

For context, -rea off in llama.cpp only controls how the server parses thinking tokens in the output β€” it doesn't prevent the model from generating them. To actually suppress thinking, the chat template needs to handle enable_thinking=False properly (like Qwen3 does), and the model needs to have been trained to respect that toggle. Looking at Sarvam's model card, all their examples use enable_thinking=True, which suggests a no-think mode may not be supported yet.

https://huggingface.co/sarvamai/sarvam-30b/discussions/11

okay, but reasoning_effort should work, right ? That is part of their API. Did you check if it works ? This parameter is difficult to check because reasoning depends upon prompt too, there is no hard and fast rule for that. so, probable the check would be that this parameter is getting passed to the model or not by template.

enable_thinking=false β€” The template correctly inserts the <|nothink|> token after the user message, but the model still generates content anyway. This is an upstream model training issue, not a template/GGUF problem. Same issue as sarvamai/sarvam-30b#11.

reasoning_effort β€” This parameter is not implemented in Sarvam's official chat template. It's simply not referenced anywhere in the Jinja2 template , so it gets silently ignored. You can verify this yourself in the official chat_template.jinja (https://huggingface.co/sarvamai/sarvam-30b/blob/main/chat_template.jinja). The only thinking-related parameters the template supports are enable_thinking and reasoning_content.

Both the enable_thinking=false and reasoning_effort issues need to be addressed by the Sarvam team in the base model/template.

Right, but that basically means that sarvam API Docs https://docs.sarvam.ai/api-reference-docs/api-guides-tutorials/chat-completion/how-to/adjust-the-models-thinking-level
giving reasoning_effort level is API only. They didn't release this mechanicsm to public ? May be templates could have that parameter, but they just hid it in public chat template ??

Yes, you're probably right. Looking at Sarvam's API docs, reasoning_effort is passed as a parameter to their hosted API (client.chat.completions()), but they don't document how it's implemented under the hood. It could be:

A modified/internal chat template that they didn't release publicly or Server-side system prompt injection (e.g. "Think briefly" vs "Think step by step") or Sampling parameter adjustments (temperature, max thinking tokens, etc.)
Either way, it's not in the public chat template, so there's no way for us to replicate it in llama.cpp. This would need Sarvam team to either update the public template or document how reasoning_effort works so the community can implement it.

Sign up or log in to comment