EntroPIC-Nemotron-1.5B: Stable Long-Term Training of Reasoning LLMs
🚀 Stable Exploration for Reasoning Models
Entropy Stabilization with Proportional-Integral Control | State-of-the-Art 1.5B Math Reasoner
Model Description
EntroPIC-Nemotron-1.5B is a specialized reasoning model fine-tuned from OpenReasoning-Nemotron-1.5B using a novel Reinforcement Learning (RL) technique called EntroPIC.
In standard RL training for reasoning models (like RLVR), models often suffer from entropy collapse, leading to sub-optimal deterministic behaviors and loss of exploration. EntroPIC addresses this by applying Proportional-Integral (PI) Control to the entropy of the policy. By dynamically adjusting the loss coefficients of positive and negative samples, EntroPIC locks the training entropy to a desired target, enabling stable, long-term training and superior performance on complex mathematical reasoning tasks.
- Developed by: [Tencent Hunyuan & HKUST]
- Base Model: OpenReasoning-Nemotron-1.5B
- Training Method: EntroPIC (Entropy Stabilization with PI Control)
- Language(s): English
- License: Apache-2.0
The EntroPIC Method
Training process involves a mix of positive and negative samples, which affect entropy in opposing ways (positive samples decrease it, negative samples increase it). EntroPIC introduces a feedback mechanism:
- PI Controller: Monitors the difference between current policy entropy and a target entropy.
- Adaptive Coefficients: Dynamically tunes the influence of high-probability tokens in the loss function.
- Result: Prevents premature convergence while avoiding instability, allowing the model to continuously explore and improve reasoning paths.
Experimental Results
EntroPIC-Nemotron-1.5B demonstrates state-of-the-art performance among 1.5B parameter models.
1. Mathematical Reasoning Performance
Comparison against the base model and other RL fine-tuning methods (QuestA, JustRL) across 7 benchmarks.
EntroPIC achieves the highest overall performance (65.4% Pass@1), showing significant gains in challenging out-of-distribution tasks like Minerva and HMMT compared to baselines.
| Models | Math | AMC | AIME24 | AIME25 | Olympiad | Minerva | HMMT | BRUMO | CMIMC | Overall | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | pass@1 | pass@N | |
| Nemotron-1.5B | 88.7 | 95.4 | 86.6 | 100 | 51.7 | 83.3 | 46.4 | 73.3 | 62.1 | 75.3 | 25.5 | 36.4 | 30.9 | 76.7 | 49.6 | 83.3 | 26.8 | 72.5 | 52.0 | 77.4 |
| QuestA-Nemotron | 93.2 | 96.8 | 94.1 | 100 | 72.5 | 83.3 | 63.1 | 83.3 | 71.1 | 78.5 | 25.3 | 32.7 | 42.1 | 73.3 | 70.0 | 96.7 | 42.1 | 75.0 | 63.7 | 79.9 |
| JustRL-Nemotron | 94.2 | 97.6 | 95.4 | 100 | 69.6 | 86.7 | 61.5 | 83.3 | 70.5 | 77.9 | 23.9 | 31.3 | 37.5 | 63.3 | 67.2 | 90.0 | 39.2 | 72.5 | 62.1 | 78.1 |
| EntroPIC-Nemotron | 93.2 | 96.8 | 96.4 | 100 | 74.9 | 90.0 | 68.3 | 93.3 | 70.1 | 78.3 | 36.4 | 46.7 | 42.7 | 76.7 | 63.8 | 93.3 | 43.0 | 77.5 | 65.4 | 83.6 |
2. Robust Generalization
A common challenge in RL fine-tuning is the "alignment tax"—catastrophic forgetting of general capabilities. EntroPIC effectively mitigates this.
As shown above, while other RL methods (QuestA, JustRL) suffer from severe degradation in general reasoning (MMLU-Pro) and coding (LiveCodeBench), EntroPIC (Orange) not only preserves but improves upon the base model's capabilities, demonstrating a superior balance between specialization and generalization.
Quickstart
You can use this model with the standard Hugging Face transformers library. The model is trained to generate Chain-of-Thought (CoT) reasoning.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "yangkaiSIGS/EntroPIC-Nemotron-1.5B"
# 1. Load Model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 2. Prepare Prompt (Standard Math Problem)
problem = "Let f(x) = (x - 18)(x - 72)(x - 98)(x - k) / x. Find the sum of all positive real values of k such that f has exactly two local minima."
messages = [
{"role": "user", "content": problem}
]
# 3. Apply Chat Template (if applicable) or simple formatting
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# 4. Generate Reasoning
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
do_sample=True
)
# 5. Decode Output
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Citation
If you find this model or the EntroPIC method useful in your research, please cite our paper:
@article{yang2025entropic,
title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
journal={arXiv preprint arXiv:2511.15248},
year={2025}
}
- Downloads last month
- 31

