Instructions to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-8B-Intermediate-ckpts")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/Nemotron-Cascade-8B-Intermediate-ckpts", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nvidia/Nemotron-Cascade-8B-Intermediate-ckpts
- SGLang
How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with Docker Model Runner:
docker model run hf.co/nvidia/Nemotron-Cascade-8B-Intermediate-ckpts
Nemotron-Cascade-8B Intermediate ckpts
Introduction
This repository releases the intermediate checkpoints produced during the development of Nemotron-Cascade-8B. Nemotron-Cascade-8B is a general-purpose model trained using a sequential, domain-wise reinforcement learning pipeline, illustrated in the figure below.
We release checkpoints corresponding to each major stage of training:
- Nemotron-Cascade-8B-SFT (completed multi-stage SFT)
- Nemotron-Cascade-8B-RLHF (completed RLHF)
- Nemotron-Cascade-8B-IFRL (completed instruction following RL)
- Nemotron-Cascade-8B-MathRL (completed Math RL)
- Nemotron-Cascade-8B-CodeRL (completed Code RL)
The final model, Nemotron-Cascade-8B, is obtained after the concluding SWE RL stage.
Usage Recommendations
We recommend using RoPE scaling with the YaRN method to better support contexts longer than 32K. This can be enabled by updating the model’s config.json as shown below:
{
...,
"rope_scaling": {
"rope_type": "yarn",
"factor": 2.0,
"original_max_position_embeddings": 32768
}
}
Results
Same as Nemotron-Cascade-8B, we use a maximum output length of 64K tokens for evaluation, with the temperature set to 0.6 and top-p to 0.95. We also apply RoPE scaling using the YaRN method with a scaling factor of 2.0.
| Benchmark Metric: Pass@1 |
Nemotron- Cascade-8B-SFT |
Nemotron- Cascade-8B-RLHF |
Nemotron- Cascade-8B-IFRL |
Nemotron- Cascade-8B-MathRL |
Nemotron- Cascade-8B-CodeRL |
Nemotron- Cascade-8B |
|---|---|---|---|---|---|---|
| Knowledge Reasoning | ||||||
| MMLU | 83.0 | 83.1 | 83.4 | 83.4 | 83.7 | 83.7 |
| MMLU Pro | 74.4 | 77.8 | 74.5 | 75.0 | 75.3 | 75.7 |
| GPQA-Diamond | 63.5 | 66.8 | 66.1 | 65.7 | 67.4 | 66.5 |
| Alignment | ||||||
| ArenaHard | 70.0 | 90.1 | 88.0 | 87.0 | 87.8 | 87.9 |
| IFEval (Strict Prompt) | 70.8 | 50.1 | 90.4 | 92.1 | 90.7 | 90.2 |
| IFBench | 21.2 | 24.5 | 40.5 | 40.4 | 38.1 | 40.8 |
| Math | ||||||
| AIME 2024 | 83.6 | 86.1 | 86.2 | 90.2 | 89.1 | 89.5 |
| AIME 2025 | 72.8 | 75.0 | 75.2 | 81.9 | 80.5 | 80.1 |
| Code | ||||||
| LCB v5 (08/24-02/25) | 59.2 | 70.2 | 70.2 | 70.6 | 75.3 | 74.3 |
| LCB v6 (08/24-05/25) | 56.7 | 67.2 | 66.7 | 67.4 | 71.5 | 71.1 |
| SWE Verified (Agentless) | 26.1 | 28.2 | 28.3 | 30.6 | 31.6 | 37.2 |
Chat Template
All intermediate checkpoints use the same chat template as Nemotron-Cascade-8B. Each is a unified model supporting both thinking and instruct (non-reasoning) modes. To switch between these two modes, simply append the " /think" (for thinking) or the " /no_think" (for instruct) tag to the end of the user input. See Nemotron-Cascade-8B for additional details.
Release Date
Dec 19, 2025
License
Your use of this model is governed by the NVIDIA Open Model License.
Citation
@article{Nemotron_Cascade_Scaling_Cascaded_Reinforcement_Learning,
title={Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models},
author={Wang, Boxin and Lee, Chankyu and Lee, Nayeon and Lin, Sheng-Chieh and Dai, Wenliang and Chen, Yang and Chen, Yangyi and Yang, Zhuolin and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
year={2025}
}