Instructions to use nvidia/OpenReasoning-Nemotron-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/OpenReasoning-Nemotron-32B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/OpenReasoning-Nemotron-32B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/OpenReasoning-Nemotron-32B") model = AutoModelForCausalLM.from_pretrained("nvidia/OpenReasoning-Nemotron-32B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/OpenReasoning-Nemotron-32B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/OpenReasoning-Nemotron-32B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/OpenReasoning-Nemotron-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/OpenReasoning-Nemotron-32B
- SGLang
How to use nvidia/OpenReasoning-Nemotron-32B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/OpenReasoning-Nemotron-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/OpenReasoning-Nemotron-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/OpenReasoning-Nemotron-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/OpenReasoning-Nemotron-32B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/OpenReasoning-Nemotron-32B with Docker Model Runner:
docker model run hf.co/nvidia/OpenReasoning-Nemotron-32B
long outputs
Could you comment on the issue of the long outputs generated by the latest reasoning models? Are they expected to produce thousands of tokens for each prompt?
Yes, these models are expected to think for many tokens before finalizing the answer. We recommend using 64K output tokens. It should be possible to make them more efficient in token usage or even add a controllable token budget with a separate round of RL, but we didn't do it yet.
I see that you changed the eos_token. Will it affect this behaviour?
No, it should only affect things if you create a finetuned version of the model. The current models' behavior should stay the same
Could you help clarify what impact this has on generation?
If the model was originally trained to emit 151643 as the eos token, but the runtime now expects 151645, wouldn't that cause a mismatch, where generation might not stop unless the new token happens to be emitted?
Does the model actually emit 151645 under current weights, or was it trained to use 151643? (which EOS token is actually the "correct" one from the model’s point of view?)
Also, as I understand it, this change would require re-exporting the model to GGUF, since llama.cpp converter read these config files. So even though the model weights remain unchanged, a new GGUF would need to be generated to reflect the updated eos_token and its ID.
The model would always end with <|im_end|>\n<|endoftext|>, which corresponds to [151645, 198, 151643]. So basically, the new change will stop it 2 tokens before, which shouldn't really matter in most situations. But if you finetune this model on the new data without <|endoftext|>, it will not properly stop without this pr we merged
Thank you