How to turn off thinking mode

#86

by Gierry - opened Aug 8, 2025

Aug 8, 2025

I know there are three thinking modes, but in some scenarios the Low mode is still too slow for me to output.
May I ask if there is a way like Qwen that has a no_think mode?

xianf

Aug 8, 2025

How to turn on the thinking mode?

reach-vb

OpenAI org Aug 8, 2025

There is no way to turn off reasoning, how effor you can control the amount of effort by specifying Reasoning effort - it can be either low, medium or high.

anhnmt

Aug 11, 2025

J0hn-D0E

Aug 29, 2025

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

That's a hit!

logxdx

Sep 19, 2025

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

How does it impact the model performance?

anhnmt

Sep 23, 2025

Yeah, that’s the hacky way to fully kill reasoning mode in GPT-OSS. But just keep in mind, in their paper they mention the model was post-trained with CoT-RL, which means it always uses reasoning (with variable effort: low/med/high). So turning it off like this isn’t really how the model was designed to work.

I haven’t benchmarked it myself, but I think it will decrease performance a lot — not just on reasoning-heavy stuff (math, coding, logic), but even on simpler tasks, since the model was never trained to operate without reasoning.

SiddhJagani

Oct 14, 2025

I Have uploaded that in my Profile, Full ready to install!

I appreciate If wanna try 😊

kalashshah19

Nov 19, 2025

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

How does it impact the model performance?

I think it will just show the final message / response from the model and hide the reasoning. So it won't affect the performance i guess.

gozus19p

Dec 12, 2025

•

edited Dec 12, 2025

Hi. I'm happy to join this thread.
I fine tuned GPT OSS 20B with a technique useful to maintain its naive reasoning capability. It worked.

In my case, I need to process 570k prompts and, with naive thinking enabled, it takes about 147s (mean) to complete a generation (2 NVIDIA H100, MPX4 quantization activated, batched inference with 8 batch size).
Timings drastically drops to ~30s when I applied @anhnmt workaround. Then, I looked at the outcomes. I did not set up any quantitative benchmark, but I saw a consistent performance degradation.

In the fine tuning scenario, an idea could be to train the model computing loss on reasoning tokens also (that's done by default).
With consistent training, this would lead to direct generation and thinking removal. I see a caveat: what would happen in this hypothetical scenario is that the LoRA adapter (assuming to train following PEFT approach) would be trained to accomplish two things:

Produce the expected output, useful in my scenario.
Forget the capability to reason.
BUT! Since that the base model weights are frozen, the LoRA adapter would suffer from learning two things together instead of focusing on the main task only.

TL;DR: I decided to accept longer processing times, prioritising quality over timing.

I would like to hear more on this topic. I'm interested in finding the sweet spot between effectiveness and efficiency.

gozus19p

Dec 14, 2025

Hi,

Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.

Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.

Gerald001

12 days ago

•

edited 12 days ago

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

how to set this when using vllm? how to use the edited chat_template.jinja with it?

would the following work?

edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330
read chat_template and call apply_chat_template:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

with open("chat_template.jinja", "r", encoding="utf-8") as f:
    tokenizer.chat_template = f.read()

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="low",
)

Gerald001

12 days ago

Hi,

Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.

Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.

what did you benchmark?

anhnmt

12 days ago

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

how to set this when using vllm? how to use the edited chat_template.jinja with it?

would the following work?

edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330

read chat_template and call apply_chat_template:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

with open("chat_template.jinja", "r", encoding="utf-8") as f:
    tokenizer.chat_template = f.read()

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="low",
)

You cannot disable reasoning mode in vLLM by editing chat_template.jinja or by calling apply_chat_template() like you would with standard Hugging Face models.

vLLM does not use apply_chat_template() for models such as OpenAI’s GPT-OSS series. Instead, it relies on the Harmony framework internally to construct prompts and parse outputs.

If you want to disable reasoning mode in vLLM, you must modify the vLLM source code directly.

For example, I am using vLLM 0.10.2, and the relevant file is:

vllm/entrypoints/harmony_utils.py

1. Modify `render_for_completion`

This function generates the token_ids that are passed to the model. You can append custom tokens to bypass reasoning:

def render_for_completion(messages: list[Message]) -> list[int]:
    # a tricky way to bypass reasoning
    extended_token_ids = [200005, 35644, 200008, 200007, 200006, 173781]
    conversation = Conversation.from_messages(messages)
    token_ids = get_encoding().render_conversation_for_completion(
        conversation, Role.ASSISTANT)
    token_ids.extend(extended_token_ids)
    return token_ids

2. Update `parse_chat_output`

You also need to modify how the output is parsed to correctly extract the final content:

def parse_chat_output(
        token_ids: Sequence[int]) -> tuple[Optional[str], Optional[str], bool]:
    parser = parse_output_into_messages(token_ids)
    output_msgs = parser.messages
    is_tool_call = False  # TODO: update this when tool call is supported
    if len(output_msgs) == 0:
        # The generation has stopped during reasoning.
        reasoning_content = parser.current_content
        final_content = None
    elif len(output_msgs) == 1:
        # The generation has stopped during final message.
        # reasoning_content = output_msgs[0].content[0].text
        # final_content = parser.current_content
        # The generation has final content only.
        reasoning_content = None
        final_content = output_msgs[0].content[0].text
    else:
        reasoning_msg = output_msgs[:-1]
        final_msg = output_msgs[-1]
        reasoning_content = "\n".join(
            [msg.content[0].text for msg in reasoning_msg])
        final_content = final_msg.content[0].text
    return reasoning_content, final_content, is_tool_call

After that, it will work as expected!

Gerald001

12 days ago

•

edited 12 days ago

A tricky workaround to disable reasoning mode is to edit the chat_template.jinja file, changing the add_generation_prompt from:
"<|start|>assistant"
to:
"<|start|>assistant<|channel|>analysis<|message|><|end|><|start|>assistant"

how to set this when using vllm? how to use the edited chat_template.jinja with it?

would the following work?

edit this part: https://huggingface.co/openai/gpt-oss-20b/blob/main/chat_template.jinja#L330

read chat_template and call apply_chat_template:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

with open("chat_template.jinja", "r", encoding="utf-8") as f:
    tokenizer.chat_template = f.read()

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    reasoning_effort="low",
)
You cannot disable reasoning mode in vLLM by editing chat_template.jinja or by calling apply_chat_template() like you would with standard Hugging Face models.

vLLM does not use apply_chat_template() for models such as OpenAI’s GPT-OSS series. Instead, it relies on the Harmony framework internally to construct prompts and parse outputs.

If you want to disable reasoning mode in vLLM, you must modify the vLLM source code directly.

For example, I am using vLLM 0.10.2, and the relevant file is:
vllm/entrypoints/harmony_utils.py
1. Modify render_for_completion

This function generates the token_ids that are passed to the model. You can append custom tokens to bypass reasoning:
def render_for_completion(messages: list[Message]) -> list[int]:
    # a tricky way to bypass reasoning
    extended_token_ids = [200005, 35644, 200008, 200007, 200006, 173781]
    conversation = Conversation.from_messages(messages)
    token_ids = get_encoding().render_conversation_for_completion(
        conversation, Role.ASSISTANT)
    token_ids.extend(extended_token_ids)
    return token_ids
2. Update parse_chat_output

You also need to modify how the output is parsed to correctly extract the final content:
def parse_chat_output(
        token_ids: Sequence[int]) -> tuple[Optional[str], Optional[str], bool]:
    parser = parse_output_into_messages(token_ids)
    output_msgs = parser.messages
    is_tool_call = False  # TODO: update this when tool call is supported
    if len(output_msgs) == 0:
        # The generation has stopped during reasoning.
        reasoning_content = parser.current_content
        final_content = None
    elif len(output_msgs) == 1:
        # The generation has stopped during final message.
        # reasoning_content = output_msgs[0].content[0].text
        # final_content = parser.current_content
        # The generation has final content only.
        reasoning_content = None
        final_content = output_msgs[0].content[0].text
    else:
        reasoning_msg = output_msgs[:-1]
        final_msg = output_msgs[-1]
        reasoning_content = "\n".join(
            [msg.content[0].text for msg in reasoning_msg])
        final_content = final_msg.content[0].text
    return reasoning_content, final_content, is_tool_call
After that, it will work as expected!

hi @anhnmt few follow up questions:

are you saying you only modified the server code?
what vllm endpoint do you call?
do you have a fully working example (client code)?
cant you use vllm the following endpoint? this endpoint should accept token_ids?
(APIServer pid=1) INFO 02-26 19:56:17 [launcher.py:46] Route: /inference/v1/generate, Methods: POST

gozus19p

6 days ago

Hi,

Just a quick note on my latest post here. I set up a proper benchmark to quantify performances of my fine tuned GPT OSS 20B with and without thinking.
Surprisingly, I was wrong. I didn't notice any degradation. Indeed, my model kept the same performances.

Probably, this is due to fine tuning set up. Actually, I don't know if these considerations are still valid in base model inference scenario.
Anyway, I thought it was a good idea to share this insight.

what did you benchmark?

Unfortunately, I cannot disclose because it's a company benchmark under NDA. I'm working in Information Extraction, and maybe the task itself is already "closed" enough that thinking does not contribute significantly to better performance, in this restricted domain.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

How to turn off thinking mode

1. Modify render_for_completion

2. Update parse_chat_output

1. Modify render_for_completion

2. Update parse_chat_output

1. Modify `render_for_completion`

2. Update `parse_chat_output`

1. Modify `render_for_completion`

2. Update `parse_chat_output`