File size: 8,219 Bytes
a5d0d66
 
b2ce9eb
 
 
 
 
 
 
 
 
 
 
 
 
a5d0d66
 
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
 
a5d0d66
280023f
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
 
 
 
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
a5d0d66
b2ce9eb
 
a5d0d66
b2ce9eb
 
 
 
a5d0d66
b2ce9eb
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
a5d0d66
b2ce9eb
 
 
 
a5d0d66
b2ce9eb
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
 
 
 
 
 
 
 
 
 
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
a5d0d66
b2ce9eb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
library_name: transformers
tags:
  - math
  - reasoning
  - reinforcement-learning
  - qwen2
  - mathematics
  - chain-of-thought
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen2.5-Math-1.5B-Instruct
pipeline_tag: text-generation
---

# Nexus-1.5B

<p align="center">
  <img src="https://img.shields.io/badge/Base%20Model-Qwen2.5--Math--1.5B--Instruct-orange" />
  <img src="https://img.shields.io/badge/Parameters-1.54B-blue" />
  <img src="https://img.shields.io/badge/Method-LPRO-green" />
  <img src="https://img.shields.io/badge/MATH--500-80.2-red" />
  <img src="https://img.shields.io/badge/GSM8K-85.2-red" />
</p>

**Nexus-1.5B** is a 1.54-billion-parameter mathematical reasoning model developed by [Neuriton](https://www.facebook.com/neuriton), trained via **Length-Penalized Reward Optimization (LPRO)** — a novel reinforcement learning alignment method that improves both accuracy and response conciseness simultaneously.

Built on top of `Qwen2.5-Math-1.5B-Instruct`, Nexus-1.5B achieves **80.2 on MATH-500** and **85.2 on GSM8K** (CoT), surpassing its base model by **+4.4 points** on MATH-500 while reducing average response length by **14%**.

---

## What is LPRO?

Standard GRPO (Group Relative Policy Optimization) suffers from two key problems:

1. **Length bias** — short responses receive disproportionately large gradient signals, implicitly penalizing long correct derivations.
2. **Entropy collapse** — symmetric probability-ratio clipping causes the policy to converge to a narrow set of solution patterns, limiting further improvement.

**LPRO** fixes both with three targeted modifications:

| Component | What it does |
|---|---|
| **Asymmetric clipping** | Decouples the lower and upper clip bounds (`ε_low=0.20`, `ε_high=0.28`) to preserve policy entropy |
| **Token-level normalization** | Replaces per-response weight `1/G` with global weight `1/Σ|oᵢ|` to produce an unbiased gradient estimate |
| **Length-penalized advantage** | Adds a group-standardized length penalty: `Aᵢ = (rᵢ - μᵣ)/(σᵣ + ε) - λ·(Lᵢ - μ_L)/(σ_L + ε)` |

The final objective is:

$$\mathcal{J}_{\text{LPRO}}(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G}|o_i|} \sum_{i=1}^{G}\sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\,\hat{A}_{i,t},\ \text{clip}_{\text{asym}}(r_{i,t}(\theta))\,\hat{A}_{i,t}\right)\right]$$

---

## Model Details

| Property | Value |
|---|---|
| **Base model** | `Qwen/Qwen2.5-Math-1.5B-Instruct` |
| **Parameters** | 1.54B |
| **Architecture** | Transformer Decoder (28 layers, GQA, RoPE, SwiGLU, RMSNorm) |
| **Context length** | 8,192 tokens |
| **Vocabulary size** | 128,256 |
| **Training method** | LPRO (RL fine-tuning, no distillation) |
| **Training data** | 100 difficulty-filtered problems from MATH-500 |
| **Group size G** | 4 |
| **Length penalty λ** | 0.10 |
| **Learning rate** | 1e-6 |
| **PPO epochs/iter** | 4 |

---

## Benchmark Results

### Chain-of-Thought (CoT)

| Model | GSM8K | MATH-500 | MMLU-STEM | CMATH | GaoKao Cloze | GaoKao QA |
|---|---|---|---|---|---|---|
| Qwen2-Math-1.5B-Instruct | 84.2 | 69.4 | 54.9 | 79.6 | 59.7 | 50.7 |
| Qwen2.5-Math-1.5B-Instruct | 84.8 | 75.8 | 57.5 | 83.0 | 65.5 | 54.1 |
| **Nexus-1.5B** | **85.2** | **80.2** | **60.3** | **83.5** | **67.2** | **56.9** |

### Tool-Integrated Reasoning (TIR)

| Model | MATH-500 | Minerva Math | GaoKao 2023 EN | Olympiad Bench | College Math |
|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B-Instruct | 80.0 | 34.0 | 68.0 | 49.0 | 54.0 |
| **Nexus-1.5B** | **84.0** | **40.0** | **74.0** | **56.0** | **57.0** |

### Ablation: Effect of Length Penalty (λ)

| λ | MATH-500 Acc. | Avg. Response Length |
|---|---|---|
| 0.0 (GRPO baseline) | 77.4 | 312 tokens |
| **0.1 (Nexus-1.5B)** | **80.2** | **268 tokens** |
| 0.3 (over-penalized) | 78.0 | 201 tokens |

> **Key insight:** At λ=0.1, accuracy and conciseness improve simultaneously. The length penalty acts as a de-noising regularizer — discouraging redundant steps rather than suppressing genuinely long derivations.

---

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Dat1710/nexus-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Chain-of-Thought prompt
system_prompt = "Please reason step by step, and put your final answer within \\boxed{}."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Find all functions f: ℝ⁺ → ℝ⁺ such that for each x ∈ ℝ⁺, there is exactly one y ∈ ℝ⁺ satisfying xf(y) + yf(x) ≤ 2."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048,
    temperature=0.7,
    do_sample=True,
)

generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

### Tool-Integrated Reasoning (TIR)

```python
system_prompt = (
    "Please integrate natural language reasoning with programs to solve the problem above, "
    "and put your final answer within \\boxed{}."
)
```

---

## Evaluation Prompt Format

**CoT (8-shot for GSM8K, 4-shot for MATH-500):**
```
<|im_start|>system
Please reason step by step, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
```

**TIR (zero-shot):**
```
<|im_start|>system
Please integrate natural language reasoning with programs to solve the problem above,
and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
{problem}<|im_end|>
<|im_start|>assistant
```

---

## Training Details

### Data Curation

Training problems are sourced from **MATH-500** and filtered by difficulty using a learnable-zone criterion: a problem is retained if, among 8 sampled solutions from the base model, **between 2 and 5 are correct**. This yields 100 training problems that provide meaningful gradient signal — neither trivially easy nor intractably hard.

### Training Procedure

1. **Group sampling:** For each prompt, sample G=4 responses from the current policy.
2. **Reward computation:** Rule-based binary reward (correctness via symbolic answer matching) + small format bonus (α=0.1) for well-formed `\boxed{}` output.
3. **Advantage computation:** Compute length-penalized group z-score advantages.
4. **Policy update:** Maximize LPRO objective for 4 epochs per iteration.
5. **Iterate:** Set old policy ← new policy and repeat.

### Reward Function

$$r_i = \mathbf{1}[\hat{a}(o_i) = a^*] + 0.1 \cdot \mathbf{1}[\text{format}(o_i)]$$

where $\hat{a}(o_i)$ is the extracted answer from the last `\boxed{}` expression, verified via symbolic equivalence.

---

## Limitations

- **Scale:** Nexus-1.5B operates at 1.54B parameters. Hard olympiad problems (e.g., AIME) remain challenging for models at this scale.
- **Language:** Primarily optimized for English and Chinese mathematical text. Performance on other languages is not evaluated.
- **Domain:** Designed for mathematical reasoning. General language understanding or instruction-following tasks are outside the model's training distribution.
- **TIR dependency:** Tool-integrated reasoning requires a sandboxed Python interpreter at inference time.

---

## Citation

If you use Nexus-1.5B in your research, please cite:

```bibtex
@techreport{neuriton2026nexus,
  title     = {Nexus-1.5B: Length-Penalized Reward Optimization for Robust Mathematical Reasoning},
  author    = {Neuriton Team},
  institution = {Neuriton},
  year      = {2026},
  month     = {Summer},
  note      = {Technical Report}
}
```

---

## Acknowledgements

We thank the Qwen Team at Alibaba Group for open-sourcing the Qwen2.5-Math model family, and the authors of DAPO for the asymmetric clipping insight that is central to LPRO.

---

*Developed by [Neuriton](https://neuriton.ai) · Summer 2026*