EntroPIC-Nemotron-1.5B: Stable Long-Term Training of Reasoning LLMs

EntroPIC

🚀 Stable Exploration for Reasoning Models

Entropy Stabilization with Proportional-Integral Control | State-of-the-Art 1.5B Math Reasoner

Paper GitHub Website

entropic_reasoning (1)

Model Description

EntroPIC-Nemotron-1.5B is a specialized reasoning model fine-tuned from OpenReasoning-Nemotron-1.5B using a novel Reinforcement Learning (RL) technique called EntroPIC.

In standard RL training for reasoning models (like RLVR), models often suffer from entropy collapse, leading to sub-optimal deterministic behaviors and loss of exploration. EntroPIC addresses this by applying Proportional-Integral (PI) Control to the entropy of the policy. By dynamically adjusting the loss coefficients of positive and negative samples, EntroPIC locks the training entropy to a desired target, enabling stable, long-term training and superior performance on complex mathematical reasoning tasks.

  • Developed by: [Tencent Hunyuan & HKUST]
  • Base Model: OpenReasoning-Nemotron-1.5B
  • Training Method: EntroPIC (Entropy Stabilization with PI Control)
  • Language(s): English
  • License: Apache-2.0

The EntroPIC Method

Training process involves a mix of positive and negative samples, which affect entropy in opposing ways (positive samples decrease it, negative samples increase it). EntroPIC introduces a feedback mechanism:

overview

  1. PI Controller: Monitors the difference between current policy entropy and a target entropy.
  2. Adaptive Coefficients: Dynamically tunes the influence of high-probability tokens in the loss function.
  3. Result: Prevents premature convergence while avoiding instability, allowing the model to continuously explore and improve reasoning paths.

Experimental Results

EntroPIC-Nemotron-1.5B demonstrates state-of-the-art performance among 1.5B parameter models.

1. Mathematical Reasoning Performance

Comparison against the base model and other RL fine-tuning methods (QuestA, JustRL) across 7 benchmarks.

EntroPIC achieves the highest overall performance (65.4% Pass@1), showing significant gains in challenging out-of-distribution tasks like Minerva and HMMT compared to baselines.

Models Math AMC AIME24 AIME25 Olympiad Minerva HMMT BRUMO CMIMC Overall
pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N pass@1pass@N
Nemotron-1.5B 88.795.4 86.6100 51.783.3 46.473.3 62.175.3 25.536.4 30.976.7 49.683.3 26.872.5 52.077.4
QuestA-Nemotron 93.296.8 94.1100 72.583.3 63.183.3 71.178.5 25.332.7 42.173.3 70.096.7 42.175.0 63.779.9
JustRL-Nemotron 94.297.6 95.4100 69.686.7 61.583.3 70.577.9 23.931.3 37.563.3 67.290.0 39.272.5 62.178.1
EntroPIC-Nemotron 93.296.8 96.4100 74.990.0 68.393.3 70.178.3 36.446.7 42.776.7 63.893.3 43.077.5 65.483.6

2. Robust Generalization

A common challenge in RL fine-tuning is the "alignment tax"—catastrophic forgetting of general capabilities. EntroPIC effectively mitigates this.

As shown above, while other RL methods (QuestA, JustRL) suffer from severe degradation in general reasoning (MMLU-Pro) and coding (LiveCodeBench), EntroPIC (Orange) not only preserves but improves upon the base model's capabilities, demonstrating a superior balance between specialization and generalization.

Quickstart

You can use this model with the standard Hugging Face transformers library. The model is trained to generate Chain-of-Thought (CoT) reasoning.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "yangkaiSIGS/EntroPIC-Nemotron-1.5B" 

# 1. Load Model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Prepare Prompt (Standard Math Problem)
problem = "Let f(x) = (x - 18)(x - 72)(x - 98)(x - k) / x. Find the sum of all positive real values of k such that f has exactly two local minima."
messages = [
    {"role": "user", "content": problem}
]

# 3. Apply Chat Template (if applicable) or simple formatting
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# 4. Generate Reasoning
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=32768,
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

# 5. Decode Output
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Citation

If you find this model or the EntroPIC method useful in your research, please cite our paper:

@article{yang2025entropic,
  title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
  author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
  journal={arXiv preprint arXiv:2511.15248},
  year={2025}
}
Downloads last month
31
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yangkaiSIGS/EntroPIC-Nemotron-1.5b

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1)
this model
Quantizations
2 models

Dataset used to train yangkaiSIGS/EntroPIC-Nemotron-1.5b

Space using yangkaiSIGS/EntroPIC-Nemotron-1.5b 1

Paper for yangkaiSIGS/EntroPIC-Nemotron-1.5b