Instructions to use nvidia/OpenReasoning-Nemotron-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/OpenReasoning-Nemotron-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/OpenReasoning-Nemotron-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/OpenReasoning-Nemotron-32B")
model = AutoModelForCausalLM.from_pretrained("nvidia/OpenReasoning-Nemotron-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/OpenReasoning-Nemotron-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/OpenReasoning-Nemotron-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/OpenReasoning-Nemotron-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/OpenReasoning-Nemotron-32B

SGLang

How to use nvidia/OpenReasoning-Nemotron-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/OpenReasoning-Nemotron-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/OpenReasoning-Nemotron-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/OpenReasoning-Nemotron-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/OpenReasoning-Nemotron-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/OpenReasoning-Nemotron-32B with Docker Model Runner:
```
docker model run hf.co/nvidia/OpenReasoning-Nemotron-32B
```

long outputs

by jacek2024 - opened Jul 22, 2025

Discussion

jacek2024

Jul 22, 2025

Could you comment on the issue of the long outputs generated by the latest reasoning models? Are they expected to produce thousands of tokens for each prompt?

igitman

NVIDIA org Jul 22, 2025

Yes, these models are expected to think for many tokens before finalizing the answer. We recommend using 64K output tokens. It should be possible to make them more efficient in token usage or even add a controllable token budget with a separate round of RL, but we didn't do it yet.

jacek2024

Jul 23, 2025

I see that you changed the eos_token. Will it affect this behaviour?

igitman

NVIDIA org Jul 23, 2025

No, it should only affect things if you create a finetuned version of the model. The current models' behavior should stay the same

jacek2024

Jul 23, 2025

Could you help clarify what impact this has on generation?

If the model was originally trained to emit 151643 as the eos token, but the runtime now expects 151645, wouldn't that cause a mismatch, where generation might not stop unless the new token happens to be emitted?
Does the model actually emit 151645 under current weights, or was it trained to use 151643? (which EOS token is actually the "correct" one from the model’s point of view?)

Also, as I understand it, this change would require re-exporting the model to GGUF, since llama.cpp converter read these config files. So even though the model weights remain unchanged, a new GGUF would need to be generated to reflect the updated eos_token and its ID.

igitman

NVIDIA org Jul 23, 2025

The model would always end with <|im_end|>\n<|endoftext|>, which corresponds to [151645, 198, 151643]. So basically, the new change will stop it 2 tokens before, which shouldn't really matter in most situations. But if you finetune this model on the new data without <|endoftext|>, it will not properly stop without this pr we merged

jacek2024

Jul 23, 2025

Thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment