KnowRL-Nemotron-1.5B
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
Model Summary
KnowRL-Nemotron-1.5B is a 1.5B-parameter math reasoning model trained with reinforcement learning (DAPO/GRPO) under minimal-sufficient knowledge point (KP) guidance. It is fine-tuned from nvidia/OpenMath-Nemotron-1.5B and achieves state-of-the-art results among 1.5B-scale models on competition-level math benchmarks.
Instead of injecting long solution hints or full reasoning templates, KnowRL decomposes guidance into atomic knowledge points (KPs) and identifies the minimal subset required to unlock reward learning — achieving more with less.
Key Highlights
- 74.16 average accuracy (CSS) across 8 competition-level math benchmarks — new SOTA at 1.5B scale
- 70.08 average accuracy even without KP hints at inference, demonstrating genuine policy improvement (+9.63 over baseline)
- Trained with ~38% fewer KPs than full-KP injection via the CSS (Constrained Subset Search) selection strategy
- Reward sparsity reduced from 41.21% zero-correct to 13.00% during training
Results
| Benchmark | w/o KP | CBRS | CSS |
|---|---|---|---|
| AIME 2024 | 69.79 | 75.52 | 74.58 |
| AIME 2025 | 64.69 | 65.00 | 65.21 |
| BRUMO 2025 | 69.48 | 78.33 | 78.12 |
| HMMT 2025 | 41.04 | 45.00 | 48.75 |
| AMC 2023 | 95.55 | 95.78 | 95.70 |
| CMIMC 2025 | 44.14 | 49.22 | 52.19 |
| MATH-500 | 95.70 | 96.45 | 96.20 |
| Olympiad Bench | 80.23 | 82.34 | 82.44 |
| Average | 70.08 | 73.46 | 74.16 |
w/o KP: No knowledge point hints at inference. CBRS / CSS: KP hints selected by the respective strategy are prepended to the prompt at inference.
Usage
Basic Inference (without KP hints)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "HasuerYu/KnowRL-Nemotron-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
problem = "Find the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square."
prompt = f"{problem}\nPlease reason step by step, and put your final answer within \\boxed{{}}."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)
Inference with KP Hints
For best performance, prepend selected knowledge points as a hint section in the prompt:
knowledge_points = [
"If n^2 - 19n + 99 = m^2, then (2n - 19)^2 - 4m^2 = -15.",
]
hint = "## Hint\n" + "\n".join(f"- {kp}" for kp in knowledge_points)
prompt = f"{problem}\n{hint}\nPlease reason step by step, and put your final answer within \\boxed{{}}."
vLLM Serving
vllm serve HasuerYu/KnowRL-Nemotron-1.5B \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--trust-remote-code
Training Details
| Parameter | Value |
|---|---|
| Base model | nvidia/OpenMath-Nemotron-1.5B |
| Algorithm | DAPO / GRPO |
| Framework | verl + Ray |
| Learning rate | 1e-6 |
| Batch size | 256 |
| Max prompt length | 8,192 |
| Max response length | 32,768 |
| Samples per prompt | 8 |
| Total training steps | 2,960 |
| Hardware | 8× NVIDIA H100 nodes (64 GPUs) |
An entropy annealing strategy is applied: after step 2,590, the clip upper bound is reduced from 0.28 to 0.26 to encourage the policy to shift from exploration to exploitation, contributing +0.74 average accuracy.
How KnowRL Works
- KP Extraction: Decompose solution guidance into atomic knowledge points (KPs)
- KP Selection: Apply selection strategies (CSS, CBRS) to identify the minimal-sufficient subset of KPs per problem
- RL Training: Train with DAPO/GRPO, injecting selected KPs as hints in the prompt during rollout
- Inference: The trained model can be used with or without KP hints — even without hints, it significantly outperforms the baseline
Related Resources
| Resource | Link |
|---|---|
| KnowRL Collection | HasuerYu/knowrl |
| Training Data | HasuerYu/KnowRL-Train-Data |
| KP Annotations | HasuerYu/KnowRL-KP-Annotations |
Limitations
- Optimized for competition-level math reasoning; performance on other domains is not evaluated
- KP hint quality at inference depends on upstream KP extraction and selection pipelines
- The model inherits limitations from the base model (
nvidia/OpenMath-Nemotron-1.5B)
Citation
If you find this model helpful, please cite:
@misc{yu2026knowrlboostingllmreasoning,
title={KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance},
author={Linhao Yu and Tianmeng Yang and Siyu Ding and Renren Jin and Naibin Gu and Xiangzhao Hao and Shuaiyi Nie and Deyi Xiong and Weichong Yin and Yu Sun and Hua Wu},
year={2026},
eprint={2604.12627},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.12627},
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 326
Model tree for HasuerYu/KnowRL-Nemotron-1.5B
Base model
Qwen/Qwen2.5-1.5BCollection including HasuerYu/KnowRL-Nemotron-1.5B
Paper for HasuerYu/KnowRL-Nemotron-1.5B
Evaluation results
- Accuracy (w/o KP) on AIME 2024self-reported69.790
- Accuracy (CSS) on AIME 2024self-reported74.580
- Accuracy (w/o KP) on AIME 2025self-reported64.690
- Accuracy (CSS) on AIME 2025self-reported65.210
- Accuracy (w/o KP) on MATH-500self-reported95.700
- Accuracy (CSS) on MATH-500self-reported96.200
- Accuracy (w/o KP) on Olympiad Benchself-reported80.230
- Accuracy (CSS) on Olympiad Benchself-reported82.440