How to turn off thinking mode
I know there are three thinking modes, but in some scenarios the Low mode is still too slow for me to output.
May I ask if there is a way like Qwen that has a no_think mode?
How to turn on the thinking mode?
There is no way to turn off reasoning, how effor you can control the amount of effort by specifying Reasoning effort - it can be either low, medium or high.
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"
That's a hit!
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"
How does it impact the model performance?
Yeah, that’s the hacky way to fully kill reasoning mode in GPT-OSS. But just keep in mind, in their paper they mention the model was post-trained with CoT-RL, which means it always uses reasoning (with variable effort: low/med/high). So turning it off like this isn’t really how the model was designed to work.
I haven’t benchmarked it myself, but I think it will decrease performance a lot — not just on reasoning-heavy stuff (math, coding, logic), but even on simpler tasks, since the model was never trained to operate without reasoning.
I Have uploaded that in my Profile, Full ready to install!
I appreciate If wanna try 😊
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"How does it impact the model performance?
I think it will just show the final message / response from the model and hide the reasoning. So it won't affect the performance i guess.
Hi. I'm happy to join this thread.
I fine tuned GPT OSS 20B with a technique useful to maintain its naive reasoning capability. It worked.
In my case, I need to process 570k prompts and, with naive thinking enabled, it takes about 147s (mean) to complete a generation (2 NVIDIA H100, MPX4 quantization activated, batched inference with 8 batch size).
Timings drastically drops to ~30s when I applied @anhnmt workaround. Then, I looked at the outcomes. I did not set up any quantitative benchmark, but I saw a consistent performance degradation.
In the fine tuning scenario, an idea could be to train the model computing loss on reasoning tokens also (that's done by default).
With consistent training, this would lead to direct generation and thinking removal. I see a caveat: what would happen in this hypothetical scenario is that the LoRA adapter (assuming to train following PEFT approach) would be trained to accomplish two things:
- Produce the expected output, useful in my scenario.
- Forget the capability to reason.
BUT! Since that the base model weights are frozen, the LoRA adapter would suffer from learning two things together instead of focusing on the main task only.
TL;DR: I decided to accept longer processing times, prioritising quality over timing.
I would like to hear more on this topic. I'm interested in finding the sweet spot between effectiveness and efficiency.
Hi,
Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.
Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"
how to set this when using vllm? how to use the edited chat_template.jinja with it?
would the following work?
edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330
read chat_template and call apply_chat_template:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
with open("chat_template.jinja", "r", encoding="utf-8") as f:
tokenizer.chat_template = f.read()
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
reasoning_effort="low",
)
Hi,
Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.
what did you benchmark?
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"how to set this when using vllm? how to use the edited chat_template.jinja with it?
would the following work?
edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330
read chat_template and call apply_chat_template:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) with open("chat_template.jinja", "r", encoding="utf-8") as f: tokenizer.chat_template = f.read() prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, reasoning_effort="low", )
You cannot disable reasoning mode in vLLM by editing chat_template.jinja or by calling apply_chat_template() like you would with standard Hugging Face models.
vLLM does not use apply_chat_template() for models such as OpenAI’s GPT-OSS series. Instead, it relies on the Harmony framework internally to construct prompts and parse outputs.
If you want to disable reasoning mode in vLLM, you must modify the vLLM source code directly.
For example, I am using vLLM 0.10.2, and the relevant file is:
vllm/entrypoints/harmony_utils.py
1. Modify render_for_completion
This function generates the token_ids that are passed to the model. You can append custom tokens to bypass reasoning:
def render_for_completion(messages: list[Message]) -> list[int]:
# a tricky way to bypass reasoning
extended_token_ids = [200005, 35644, 200008, 200007, 200006, 173781]
conversation = Conversation.from_messages(messages)
token_ids = get_encoding().render_conversation_for_completion(
conversation, Role.ASSISTANT)
token_ids.extend(extended_token_ids)
return token_ids
2. Update parse_chat_output
You also need to modify how the output is parsed to correctly extract the final content:
def parse_chat_output(
token_ids: Sequence[int]) -> tuple[Optional[str], Optional[str], bool]:
parser = parse_output_into_messages(token_ids)
output_msgs = parser.messages
is_tool_call = False # TODO: update this when tool call is supported
if len(output_msgs) == 0:
# The generation has stopped during reasoning.
reasoning_content = parser.current_content
final_content = None
elif len(output_msgs) == 1:
# The generation has stopped during final message.
# reasoning_content = output_msgs[0].content[0].text
# final_content = parser.current_content
# The generation has final content only.
reasoning_content = None
final_content = output_msgs[0].content[0].text
else:
reasoning_msg = output_msgs[:-1]
final_msg = output_msgs[-1]
reasoning_content = "\n".join(
[msg.content[0].text for msg in reasoning_msg])
final_content = final_msg.content[0].text
return reasoning_content, final_content, is_tool_call
After that, it will work as expected!
A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"how to set this when using vllm? how to use the edited chat_template.jinja with it?
would the following work?
edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330
read chat_template and call apply_chat_template:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) with open("chat_template.jinja", "r", encoding="utf-8") as f: tokenizer.chat_template = f.read() prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, reasoning_effort="low", )You cannot disable reasoning mode in vLLM by editing
chat_template.jinjaor by callingapply_chat_template()like you would with standard Hugging Face models.vLLM does not use
apply_chat_template()for models such as OpenAI’s GPT-OSS series. Instead, it relies on the Harmony framework internally to construct prompts and parse outputs.If you want to disable reasoning mode in vLLM, you must modify the vLLM source code directly.
For example, I am using vLLM 0.10.2, and the relevant file is:
vllm/entrypoints/harmony_utils.py1. Modify
render_for_completionThis function generates the
token_idsthat are passed to the model. You can append custom tokens to bypass reasoning:def render_for_completion(messages: list[Message]) -> list[int]: # a tricky way to bypass reasoning extended_token_ids = [200005, 35644, 200008, 200007, 200006, 173781] conversation = Conversation.from_messages(messages) token_ids = get_encoding().render_conversation_for_completion( conversation, Role.ASSISTANT) token_ids.extend(extended_token_ids) return token_ids2. Update
parse_chat_outputYou also need to modify how the output is parsed to correctly extract the final content:
def parse_chat_output( token_ids: Sequence[int]) -> tuple[Optional[str], Optional[str], bool]: parser = parse_output_into_messages(token_ids) output_msgs = parser.messages is_tool_call = False # TODO: update this when tool call is supported if len(output_msgs) == 0: # The generation has stopped during reasoning. reasoning_content = parser.current_content final_content = None elif len(output_msgs) == 1: # The generation has stopped during final message. # reasoning_content = output_msgs[0].content[0].text # final_content = parser.current_content # The generation has final content only. reasoning_content = None final_content = output_msgs[0].content[0].text else: reasoning_msg = output_msgs[:-1] final_msg = output_msgs[-1] reasoning_content = "\n".join( [msg.content[0].text for msg in reasoning_msg]) final_content = final_msg.content[0].text return reasoning_content, final_content, is_tool_callAfter that, it will work as expected!
hi @anhnmt few follow up questions:
- are you saying you only modified the server code?
- what vllm endpoint do you call?
- do you have a fully working example (client code)?
- cant you use vllm the following endpoint? this endpoint should accept token_ids?
(APIServer pid=1) INFO 02-26 19:56:17 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
Hi,
Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.what did you benchmark?
Unfortunately, I cannot disclose because it's a company benchmark under NDA. I'm working in Information Extraction, and maybe the task itself is already "closed" enough that thinking does not contribute significantly to better performance, in this restricted domain.