🚀 Stable Exploration for Reasoning Models

Entropy Stabilization with Proportional-Integral Control | State-of-the-Art 1.5B Math Reasoner

Model Description

EntroPIC-Nemotron-1.5B is a specialized reasoning model fine-tuned from OpenReasoning-Nemotron-1.5B using a novel Reinforcement Learning (RL) technique called EntroPIC.

In standard RL training for reasoning models (like RLVR), models often suffer from entropy collapse, leading to sub-optimal deterministic behaviors and loss of exploration. EntroPIC addresses this by applying Proportional-Integral (PI) Control to the entropy of the policy. By dynamically adjusting the loss coefficients of positive and negative samples, EntroPIC locks the training entropy to a desired target, enabling stable, long-term training and superior performance on complex mathematical reasoning tasks.

Developed by: [Tencent Hunyuan & HKUST]
Base Model: OpenReasoning-Nemotron-1.5B
Training Method: EntroPIC (Entropy Stabilization with PI Control)
Language(s): English
License: Apache-2.0

The EntroPIC Method

Training process involves a mix of positive and negative samples, which affect entropy in opposing ways (positive samples decrease it, negative samples increase it). EntroPIC introduces a feedback mechanism:

PI Controller: Monitors the difference between current policy entropy and a target entropy.
Adaptive Coefficients: Dynamically tunes the influence of high-probability tokens in the loss function.
Result: Prevents premature convergence while avoiding instability, allowing the model to continuously explore and improve reasoning paths.

Experimental Results

EntroPIC-Nemotron-1.5B demonstrates state-of-the-art performance among 1.5B parameter models.

1. Mathematical Reasoning Performance

Comparison against the base model and other RL fine-tuning methods (QuestA, JustRL) across 7 benchmarks.

EntroPIC achieves the highest overall performance (65.4% Pass@1), showing significant gains in challenging out-of-distribution tasks like Minerva and HMMT compared to baselines.

Models	Math		AMC		AIME24		AIME25		Olympiad		Minerva		HMMT		BRUMO		CMIMC		Overall
Models	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N	pass@1	pass@N
Nemotron-1.5B	88.7	95.4	86.6	100	51.7	83.3	46.4	73.3	62.1	75.3	25.5	36.4	30.9	76.7	49.6	83.3	26.8	72.5	52.0	77.4
QuestA-Nemotron	93.2	96.8	94.1	100	72.5	83.3	63.1	83.3	71.1	78.5	25.3	32.7	42.1	73.3	70.0	96.7	42.1	75.0	63.7	79.9
JustRL-Nemotron	94.2	97.6	95.4	100	69.6	86.7	61.5	83.3	70.5	77.9	23.9	31.3	37.5	63.3	67.2	90.0	39.2	72.5	62.1	78.1
EntroPIC-Nemotron	93.2	96.8	96.4	100	74.9	90.0	68.3	93.3	70.1	78.3	36.4	46.7	42.7	76.7	63.8	93.3	43.0	77.5	65.4	83.6

2. Robust Generalization

A common challenge in RL fine-tuning is the "alignment tax"—catastrophic forgetting of general capabilities. EntroPIC effectively mitigates this.

As shown above, while other RL methods (QuestA, JustRL) suffer from severe degradation in general reasoning (MMLU-Pro) and coding (LiveCodeBench), EntroPIC (Orange) not only preserves but improves upon the base model's capabilities, demonstrating a superior balance between specialization and generalization.

Quickstart

You can use this model with the standard Hugging Face transformers library. The model is trained to generate Chain-of-Thought (CoT) reasoning.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "yangkaiSIGS/EntroPIC-Nemotron-1.5B" 

# 1. Load Model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Prepare Prompt (Standard Math Problem)
problem = "Let f(x) = (x - 18)(x - 72)(x - 98)(x - k) / x. Find the sum of all positive real values of k such that f has exactly two local minima."
messages = [
    {"role": "user", "content": problem}
]

# 3. Apply Chat Template (if applicable) or simple formatting
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# 4. Generate Reasoning
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=32768,
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

# 5. Decode Output
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Citation

If you find this model or the EntroPIC method useful in your research, please cite our paper:

@article{yang2025entropic,
  title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
  author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
  journal={arXiv preprint arXiv:2511.15248},
  year={2025}
}