Instructions to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-8B-Intermediate-ckpts")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-Cascade-8B-Intermediate-ckpts", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Cascade-8B-Intermediate-ckpts

SGLang

How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-8B-Intermediate-ckpts",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nvidia/Nemotron-Cascade-8B-Intermediate-ckpts with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Cascade-8B-Intermediate-ckpts
```

Nemotron-Cascade-8B Intermediate ckpts

Introduction

This repository releases the intermediate checkpoints produced during the development of Nemotron-Cascade-8B. Nemotron-Cascade-8B is a general-purpose model trained using a sequential, domain-wise reinforcement learning pipeline, illustrated in the figure below.

We release checkpoints corresponding to each major stage of training:

Nemotron-Cascade-8B-SFT (completed multi-stage SFT)
Nemotron-Cascade-8B-RLHF (completed RLHF)
Nemotron-Cascade-8B-IFRL (completed instruction following RL)
Nemotron-Cascade-8B-MathRL (completed Math RL)
Nemotron-Cascade-8B-CodeRL (completed Code RL)

The final model, Nemotron-Cascade-8B, is obtained after the concluding SWE RL stage.

Usage Recommendations

We recommend using RoPE scaling with the YaRN method to better support contexts longer than 32K. This can be enabled by updating the model’s config.json as shown below:

  {
    ...,
    "rope_scaling": {
        "rope_type": "yarn",
        "factor": 2.0,
        "original_max_position_embeddings": 32768
    }
  }

Results

Same as Nemotron-Cascade-8B, we use a maximum output length of 64K tokens for evaluation, with the temperature set to 0.6 and top-p to 0.95. We also apply RoPE scaling using the YaRN method with a scaling factor of 2.0.

Benchmark Metric: Pass@1	Nemotron- Cascade-8B-SFT	Nemotron- Cascade-8B-RLHF	Nemotron- Cascade-8B-IFRL	Nemotron- Cascade-8B-MathRL	Nemotron- Cascade-8B-CodeRL	Nemotron- Cascade-8B
Knowledge Reasoning
MMLU	83.0	83.1	83.4	83.4	83.7	83.7
MMLU Pro	74.4	77.8	74.5	75.0	75.3	75.7
GPQA-Diamond	63.5	66.8	66.1	65.7	67.4	66.5
Alignment
ArenaHard	70.0	90.1	88.0	87.0	87.8	87.9
IFEval (Strict Prompt)	70.8	50.1	90.4	92.1	90.7	90.2
IFBench	21.2	24.5	40.5	40.4	38.1	40.8
Math
AIME 2024	83.6	86.1	86.2	90.2	89.1	89.5
AIME 2025	72.8	75.0	75.2	81.9	80.5	80.1
Code
LCB v5 (08/24-02/25)	59.2	70.2	70.2	70.6	75.3	74.3
LCB v6 (08/24-05/25)	56.7	67.2	66.7	67.4	71.5	71.1
SWE Verified (Agentless)	26.1	28.2	28.3	30.6	31.6	37.2

Chat Template

All intermediate checkpoints use the same chat template as Nemotron-Cascade-8B. Each is a unified model supporting both thinking and instruct (non-reasoning) modes. To switch between these two modes, simply append the " /think" (for thinking) or the " /no_think" (for instruct) tag to the end of the user input. See Nemotron-Cascade-8B for additional details.

Release Date

Dec 19, 2025

License

Your use of this model is governed by the NVIDIA Open Model License.

Citation

@article{Nemotron_Cascade_Scaling_Cascaded_Reinforcement_Learning,
  title={Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models},
  author={Wang, Boxin and Lee, Chankyu and Lee, Nayeon and Lin, Sheng-Chieh and Dai, Wenliang and Chen, Yang and Chen, Yangyi and Yang, Zhuolin and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including nvidia/Nemotron-Cascade-8B-Intermediate-ckpts

Nemotron-Cascade

Collection

Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models • 14 items • Updated 24 days ago • 55

Papers for nvidia/Nemotron-Cascade-8B-Intermediate-ckpts

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Paper • 2512.13607 • Published Dec 15, 2025 • 40

YaRN: Efficient Context Window Extension of Large Language Models

Paper • 2309.00071 • Published Aug 31, 2023 • 85