Nexus-1.5B

Nexus-1.5B is a 1.54-billion-parameter mathematical reasoning model developed by Neuriton, trained via Length-Penalized Reward Optimization (LPRO) — a novel reinforcement learning alignment method that improves both accuracy and response conciseness simultaneously.

Built on top of Qwen2.5-Math-1.5B-Instruct, Nexus-1.5B achieves 80.2 on MATH-500 and 85.2 on GSM8K (CoT), surpassing its base model by +4.4 points on MATH-500 while reducing average response length by 14%.


What is LPRO?

Standard GRPO (Group Relative Policy Optimization) suffers from two key problems:

  1. Length bias — short responses receive disproportionately large gradient signals, implicitly penalizing long correct derivations.
  2. Entropy collapse — symmetric probability-ratio clipping causes the policy to converge to a narrow set of solution patterns, limiting further improvement.

LPRO fixes both with three targeted modifications:

Component What it does
Asymmetric clipping Decouples the lower and upper clip bounds (ε_low=0.20, ε_high=0.28) to preserve policy entropy
Token-level normalization Replaces per-response weight 1/G with global weight `1/Σ
Length-penalized advantage Adds a group-standardized length penalty: Aᵢ = (rᵢ - μᵣ)/(σᵣ + ε) - λ·(Lᵢ - μ_L)/(σ_L + ε)

The final objective is:

JLPRO(θ)=E[1i=1Goii=1Gt=1oimin ⁣(ri,t(θ)A^i,t, clipasym(ri,t(θ))A^i,t)]\mathcal{J}_{\text{LPRO}}(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G}|o_i|} \sum_{i=1}^{G}\sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\,\hat{A}_{i,t},\ \text{clip}_{\text{asym}}(r_{i,t}(\theta))\,\hat{A}_{i,t}\right)\right]


Model Details

Property Value
Base model Qwen/Qwen2.5-Math-1.5B-Instruct
Parameters 1.54B
Architecture Transformer Decoder (28 layers, GQA, RoPE, SwiGLU, RMSNorm)
Context length 8,192 tokens
Vocabulary size 128,256
Training method LPRO (RL fine-tuning, no distillation)
Training data 100 difficulty-filtered problems from MATH-500
Group size G 4
Length penalty λ 0.10
Learning rate 1e-6
PPO epochs/iter 4

Benchmark Results

Chain-of-Thought (CoT)

Model GSM8K MATH-500 MMLU-STEM CMATH GaoKao Cloze GaoKao QA
Qwen2-Math-1.5B-Instruct 84.2 69.4 54.9 79.6 59.7 50.7
Qwen2.5-Math-1.5B-Instruct 84.8 75.8 57.5 83.0 65.5 54.1
Nexus-1.5B 85.2 80.2 60.3 83.5 67.2 56.9

Tool-Integrated Reasoning (TIR)

Model MATH-500 Minerva Math GaoKao 2023 EN Olympiad Bench College Math
Qwen2.5-Math-1.5B-Instruct 80.0 34.0 68.0 49.0 54.0
Nexus-1.5B 84.0 40.0 74.0 56.0 57.0

Ablation: Effect of Length Penalty (λ)

λ MATH-500 Acc. Avg. Response Length
0.0 (GRPO baseline) 77.4 312 tokens
0.1 (Nexus-1.5B) 80.2 268 tokens
0.3 (over-penalized) 78.0 201 tokens

Key insight: At λ=0.1, accuracy and conciseness improve simultaneously. The length penalty acts as a de-noising regularizer — discouraging redundant steps rather than suppressing genuinely long derivations.


How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Dat1710/nexus-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Chain-of-Thought prompt
system_prompt = "Please reason step by step, and put your final answer within \\boxed{}."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Find all functions f: ℝ⁺ → ℝ⁺ such that for each x ∈ ℝ⁺, there is exactly one y ∈ ℝ⁺ satisfying xf(y) + yf(x) ≤ 2."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048,
    temperature=0.7,
    do_sample=True,
)

generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Tool-Integrated Reasoning (TIR)

system_prompt = (
    "Please integrate natural language reasoning with programs to solve the problem above, "
    "and put your final answer within \\boxed{}."
)

Evaluation Prompt Format

CoT (8-shot for GSM8K, 4-shot for MATH-500):

<|im_start|>system
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant

TIR (zero-shot):

<|im_start|>system
Please integrate natural language reasoning with programs to solve the problem above,
and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant

Training Details

Data Curation

Training problems are sourced from MATH-500 and filtered by difficulty using a learnable-zone criterion: a problem is retained if, among 8 sampled solutions from the base model, between 2 and 5 are correct. This yields 100 training problems that provide meaningful gradient signal — neither trivially easy nor intractably hard.

Training Procedure

  1. Group sampling: For each prompt, sample G=4 responses from the current policy.
  2. Reward computation: Rule-based binary reward (correctness via symbolic answer matching) + small format bonus (α=0.1) for well-formed \boxed{} output.
  3. Advantage computation: Compute length-penalized group z-score advantages.
  4. Policy update: Maximize LPRO objective for 4 epochs per iteration.
  5. Iterate: Set old policy ← new policy and repeat.

Reward Function

ri=1[a^(oi)=a]+0.11[format(oi)]r_i = \mathbf{1}[\hat{a}(o_i) = a^*] + 0.1 \cdot \mathbf{1}[\text{format}(o_i)]

where $\hat{a}(o_i)$ is the extracted answer from the last \boxed{} expression, verified via symbolic equivalence.


Limitations

  • Scale: Nexus-1.5B operates at 1.54B parameters. Hard olympiad problems (e.g., AIME) remain challenging for models at this scale.
  • Language: Primarily optimized for English and Chinese mathematical text. Performance on other languages is not evaluated.
  • Domain: Designed for mathematical reasoning. General language understanding or instruction-following tasks are outside the model's training distribution.
  • TIR dependency: Tool-integrated reasoning requires a sandboxed Python interpreter at inference time.

Citation

If you use Nexus-1.5B in your research, please cite:

@techreport{neuriton2026nexus,
  title     = {Nexus-1.5B: Length-Penalized Reward Optimization for Robust Mathematical Reasoning},
  author    = {Neuriton Team},
  institution = {Neuriton},
  year      = {2026},
  month     = {Summer},
  note      = {Technical Report}
}

Acknowledgements

We thank the Qwen Team at Alibaba Group for open-sourcing the Qwen2.5-Math model family, and the authors of DAPO for the asymmetric clipping insight that is central to LPRO.


Developed by Neuriton · Summer 2026

Downloads last month
46
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dat1710/nexus-1.5b

Finetuned
(91)
this model
Quantizations
2 models