KnowRL-Nemotron-1.5B

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Model Summary

KnowRL-Nemotron-1.5B is a 1.5B-parameter math reasoning model trained with reinforcement learning (DAPO/GRPO) under minimal-sufficient knowledge point (KP) guidance. It is fine-tuned from nvidia/OpenMath-Nemotron-1.5B and achieves state-of-the-art results among 1.5B-scale models on competition-level math benchmarks.

Instead of injecting long solution hints or full reasoning templates, KnowRL decomposes guidance into atomic knowledge points (KPs) and identifies the minimal subset required to unlock reward learning — achieving more with less.

Key Highlights

74.16 average accuracy (CSS) across 8 competition-level math benchmarks — new SOTA at 1.5B scale
70.08 average accuracy even without KP hints at inference, demonstrating genuine policy improvement (+9.63 over baseline)
Trained with ~38% fewer KPs than full-KP injection via the CSS (Constrained Subset Search) selection strategy
Reward sparsity reduced from 41.21% zero-correct to 13.00% during training

Results

Benchmark	w/o KP	CBRS	CSS
AIME 2024	69.79	75.52	74.58
AIME 2025	64.69	65.00	65.21
BRUMO 2025	69.48	78.33	78.12
HMMT 2025	41.04	45.00	48.75
AMC 2023	95.55	95.78	95.70
CMIMC 2025	44.14	49.22	52.19
MATH-500	95.70	96.45	96.20
Olympiad Bench	80.23	82.34	82.44
Average	70.08	73.46	74.16

w/o KP: No knowledge point hints at inference. CBRS / CSS: KP hints selected by the respective strategy are prepended to the prompt at inference.

Usage

Basic Inference (without KP hints)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HasuerYu/KnowRL-Nemotron-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

problem = "Find the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square."
prompt = f"{problem}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=32768, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Inference with KP Hints

For best performance, prepend selected knowledge points as a hint section in the prompt:

knowledge_points = [
    "If n^2 - 19n + 99 = m^2, then (2n - 19)^2 - 4m^2 = -15.",
]
hint = "## Hint\n" + "\n".join(f"- {kp}" for kp in knowledge_points)
prompt = f"{problem}\n{hint}\nPlease reason step by step, and put your final answer within \\boxed{{}}."

vLLM Serving

vllm serve HasuerYu/KnowRL-Nemotron-1.5B \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Training Details

Parameter	Value
Base model	`nvidia/OpenMath-Nemotron-1.5B`
Algorithm	DAPO / GRPO
Framework	verl + Ray
Learning rate	1e-6
Batch size	256
Max prompt length	8,192
Max response length	32,768
Samples per prompt	8
Total training steps	2,960
Hardware	8× NVIDIA H100 nodes (64 GPUs)

An entropy annealing strategy is applied: after step 2,590, the clip upper bound is reduced from 0.28 to 0.26 to encourage the policy to shift from exploration to exploitation, contributing +0.74 average accuracy.

How KnowRL Works

KP Extraction: Decompose solution guidance into atomic knowledge points (KPs)
KP Selection: Apply selection strategies (CSS, CBRS) to identify the minimal-sufficient subset of KPs per problem
RL Training: Train with DAPO/GRPO, injecting selected KPs as hints in the prompt during rollout
Inference: The trained model can be used with or without KP hints — even without hints, it significantly outperforms the baseline

Related Resources

Resource	Link
KnowRL Collection	HasuerYu/knowrl
Training Data	HasuerYu/KnowRL-Train-Data
KP Annotations	HasuerYu/KnowRL-KP-Annotations

Limitations

Optimized for competition-level math reasoning; performance on other domains is not evaluated
KP hint quality at inference depends on upstream KP extraction and selection pipelines
The model inherits limitations from the base model (nvidia/OpenMath-Nemotron-1.5B)

Citation

If you find this model helpful, please cite:

@misc{yu2026knowrlboostingllmreasoning,
      title={KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance}, 
      author={Linhao Yu and Tianmeng Yang and Siyu Ding and Renren Jin and Naibin Gu and Xiangzhao Hao and Shuaiyi Nie and Deyi Xiong and Weichong Yin and Yu Sun and Hua Wu},
      year={2026},
      eprint={2604.12627},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.12627}, 
}