KnowRL-Nemotron-1.5B

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

arXiv GitHub Collection Training Data KP Annotations

Model Summary

KnowRL-Nemotron-1.5B is a 1.5B-parameter math reasoning model trained with reinforcement learning (DAPO/GRPO) under minimal-sufficient knowledge point (KP) guidance. It is fine-tuned from nvidia/OpenMath-Nemotron-1.5B and achieves state-of-the-art results among 1.5B-scale models on competition-level math benchmarks.

Instead of injecting long solution hints or full reasoning templates, KnowRL decomposes guidance into atomic knowledge points (KPs) and identifies the minimal subset required to unlock reward learning — achieving more with less.

Key Highlights

  • 74.16 average accuracy (CSS) across 8 competition-level math benchmarks — new SOTA at 1.5B scale
  • 70.08 average accuracy even without KP hints at inference, demonstrating genuine policy improvement (+9.63 over baseline)
  • Trained with ~38% fewer KPs than full-KP injection via the CSS (Constrained Subset Search) selection strategy
  • Reward sparsity reduced from 41.21% zero-correct to 13.00% during training

Results

Benchmark w/o KP CBRS CSS
AIME 2024 69.79 75.52 74.58
AIME 2025 64.69 65.00 65.21
BRUMO 2025 69.48 78.33 78.12
HMMT 2025 41.04 45.00 48.75
AMC 2023 95.55 95.78 95.70
CMIMC 2025 44.14 49.22 52.19
MATH-500 95.70 96.45 96.20
Olympiad Bench 80.23 82.34 82.44
Average 70.08 73.46 74.16

w/o KP: No knowledge point hints at inference. CBRS / CSS: KP hints selected by the respective strategy are prepended to the prompt at inference.

Usage

Basic Inference (without KP hints)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HasuerYu/KnowRL-Nemotron-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

problem = "Find the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square."
prompt = f"{problem}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Inference with KP Hints

For best performance, prepend selected knowledge points as a hint section in the prompt:

knowledge_points = [
    "If n^2 - 19n + 99 = m^2, then (2n - 19)^2 - 4m^2 = -15.",
]
hint = "## Hint\n" + "\n".join(f"- {kp}" for kp in knowledge_points)
prompt = f"{problem}\n{hint}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

vLLM Serving

vllm serve HasuerYu/KnowRL-Nemotron-1.5B \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Training Details

Parameter Value
Base model nvidia/OpenMath-Nemotron-1.5B
Algorithm DAPO / GRPO
Framework verl + Ray
Learning rate 1e-6
Batch size 256
Max prompt length 8,192
Max response length 32,768
Samples per prompt 8
Total training steps 2,960
Hardware 8× NVIDIA H100 nodes (64 GPUs)

An entropy annealing strategy is applied: after step 2,590, the clip upper bound is reduced from 0.28 to 0.26 to encourage the policy to shift from exploration to exploitation, contributing +0.74 average accuracy.

How KnowRL Works

  1. KP Extraction: Decompose solution guidance into atomic knowledge points (KPs)
  2. KP Selection: Apply selection strategies (CSS, CBRS) to identify the minimal-sufficient subset of KPs per problem
  3. RL Training: Train with DAPO/GRPO, injecting selected KPs as hints in the prompt during rollout
  4. Inference: The trained model can be used with or without KP hints — even without hints, it significantly outperforms the baseline

Related Resources

Resource Link
KnowRL Collection HasuerYu/knowrl
Training Data HasuerYu/KnowRL-Train-Data
KP Annotations HasuerYu/KnowRL-KP-Annotations

Limitations

  • Optimized for competition-level math reasoning; performance on other domains is not evaluated
  • KP hint quality at inference depends on upstream KP extraction and selection pipelines
  • The model inherits limitations from the base model (nvidia/OpenMath-Nemotron-1.5B)

Citation

If you find this model helpful, please cite:

@misc{yu2026knowrlboostingllmreasoning,
      title={KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance}, 
      author={Linhao Yu and Tianmeng Yang and Siyu Ding and Renren Jin and Naibin Gu and Xiangzhao Hao and Shuaiyi Nie and Deyi Xiong and Weichong Yin and Yu Sun and Hua Wu},
      year={2026},
      eprint={2604.12627},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.12627}, 
}

License

This model is released under the Apache 2.0 License.

Downloads last month
326
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HasuerYu/KnowRL-Nemotron-1.5B

Finetuned
(8)
this model
Quantizations
2 models

Collection including HasuerYu/KnowRL-Nemotron-1.5B

Paper for HasuerYu/KnowRL-Nemotron-1.5B

Evaluation results