Text Generation
Transformers
Safetensors
English
Chinese
qwen2
math
reasoning
reinforcement-learning
mathematics
chain-of-thought
conversational
text-generation-inference
Instructions to use Dat1710/nexus-1.5b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Dat1710/nexus-1.5b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Dat1710/nexus-1.5b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Dat1710/nexus-1.5b") model = AutoModelForCausalLM.from_pretrained("Dat1710/nexus-1.5b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Dat1710/nexus-1.5b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Dat1710/nexus-1.5b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dat1710/nexus-1.5b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Dat1710/nexus-1.5b
- SGLang
How to use Dat1710/nexus-1.5b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Dat1710/nexus-1.5b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dat1710/nexus-1.5b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Dat1710/nexus-1.5b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dat1710/nexus-1.5b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Dat1710/nexus-1.5b with Docker Model Runner:
docker model run hf.co/Dat1710/nexus-1.5b
| library_name: transformers | |
| tags: | |
| - math | |
| - reasoning | |
| - reinforcement-learning | |
| - qwen2 | |
| - mathematics | |
| - chain-of-thought | |
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| base_model: Qwen/Qwen2.5-Math-1.5B-Instruct | |
| pipeline_tag: text-generation | |
| # Nexus-1.5B | |
| <p align="center"> | |
| <img src="https://img.shields.io/badge/Base%20Model-Qwen2.5--Math--1.5B--Instruct-orange" /> | |
| <img src="https://img.shields.io/badge/Parameters-1.54B-blue" /> | |
| <img src="https://img.shields.io/badge/Method-LPRO-green" /> | |
| <img src="https://img.shields.io/badge/MATH--500-80.2-red" /> | |
| <img src="https://img.shields.io/badge/GSM8K-85.2-red" /> | |
| </p> | |
| **Nexus-1.5B** is a 1.54-billion-parameter mathematical reasoning model developed by [Neuriton](https://www.facebook.com/neuriton), trained via **Length-Penalized Reward Optimization (LPRO)** — a novel reinforcement learning alignment method that improves both accuracy and response conciseness simultaneously. | |
| Built on top of `Qwen2.5-Math-1.5B-Instruct`, Nexus-1.5B achieves **80.2 on MATH-500** and **85.2 on GSM8K** (CoT), surpassing its base model by **+4.4 points** on MATH-500 while reducing average response length by **14%**. | |
| --- | |
| ## What is LPRO? | |
| Standard GRPO (Group Relative Policy Optimization) suffers from two key problems: | |
| 1. **Length bias** — short responses receive disproportionately large gradient signals, implicitly penalizing long correct derivations. | |
| 2. **Entropy collapse** — symmetric probability-ratio clipping causes the policy to converge to a narrow set of solution patterns, limiting further improvement. | |
| **LPRO** fixes both with three targeted modifications: | |
| | Component | What it does | | |
| |---|---| | |
| | **Asymmetric clipping** | Decouples the lower and upper clip bounds (`ε_low=0.20`, `ε_high=0.28`) to preserve policy entropy | | |
| | **Token-level normalization** | Replaces per-response weight `1/G` with global weight `1/Σ|oᵢ|` to produce an unbiased gradient estimate | | |
| | **Length-penalized advantage** | Adds a group-standardized length penalty: `Aᵢ = (rᵢ - μᵣ)/(σᵣ + ε) - λ·(Lᵢ - μ_L)/(σ_L + ε)` | | |
| The final objective is: | |
| $$\mathcal{J}_{\text{LPRO}}(\theta) = \mathbb{E}\left[\frac{1}{\sum_{i=1}^{G}|o_i|} \sum_{i=1}^{G}\sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\,\hat{A}_{i,t},\ \text{clip}_{\text{asym}}(r_{i,t}(\theta))\,\hat{A}_{i,t}\right)\right]$$ | |
| --- | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | **Base model** | `Qwen/Qwen2.5-Math-1.5B-Instruct` | | |
| | **Parameters** | 1.54B | | |
| | **Architecture** | Transformer Decoder (28 layers, GQA, RoPE, SwiGLU, RMSNorm) | | |
| | **Context length** | 8,192 tokens | | |
| | **Vocabulary size** | 128,256 | | |
| | **Training method** | LPRO (RL fine-tuning, no distillation) | | |
| | **Training data** | 100 difficulty-filtered problems from MATH-500 | | |
| | **Group size G** | 4 | | |
| | **Length penalty λ** | 0.10 | | |
| | **Learning rate** | 1e-6 | | |
| | **PPO epochs/iter** | 4 | | |
| --- | |
| ## Benchmark Results | |
| ### Chain-of-Thought (CoT) | |
| | Model | GSM8K | MATH-500 | MMLU-STEM | CMATH | GaoKao Cloze | GaoKao QA | | |
| |---|---|---|---|---|---|---| | |
| | Qwen2-Math-1.5B-Instruct | 84.2 | 69.4 | 54.9 | 79.6 | 59.7 | 50.7 | | |
| | Qwen2.5-Math-1.5B-Instruct | 84.8 | 75.8 | 57.5 | 83.0 | 65.5 | 54.1 | | |
| | **Nexus-1.5B** | **85.2** | **80.2** | **60.3** | **83.5** | **67.2** | **56.9** | | |
| ### Tool-Integrated Reasoning (TIR) | |
| | Model | MATH-500 | Minerva Math | GaoKao 2023 EN | Olympiad Bench | College Math | | |
| |---|---|---|---|---|---| | |
| | Qwen2.5-Math-1.5B-Instruct | 80.0 | 34.0 | 68.0 | 49.0 | 54.0 | | |
| | **Nexus-1.5B** | **84.0** | **40.0** | **74.0** | **56.0** | **57.0** | | |
| ### Ablation: Effect of Length Penalty (λ) | |
| | λ | MATH-500 Acc. | Avg. Response Length | | |
| |---|---|---| | |
| | 0.0 (GRPO baseline) | 77.4 | 312 tokens | | |
| | **0.1 (Nexus-1.5B)** | **80.2** | **268 tokens** | | |
| | 0.3 (over-penalized) | 78.0 | 201 tokens | | |
| > **Key insight:** At λ=0.1, accuracy and conciseness improve simultaneously. The length penalty acts as a de-noising regularizer — discouraging redundant steps rather than suppressing genuinely long derivations. | |
| --- | |
| ## How to Use | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_name = "Dat1710/nexus-1.5b" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_name, | |
| torch_dtype="auto", | |
| device_map="auto" | |
| ) | |
| # Chain-of-Thought prompt | |
| system_prompt = "Please reason step by step, and put your final answer within \\boxed{}." | |
| messages = [ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": "Find all functions f: ℝ⁺ → ℝ⁺ such that for each x ∈ ℝ⁺, there is exactly one y ∈ ℝ⁺ satisfying xf(y) + yf(x) ≤ 2."} | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True | |
| ) | |
| model_inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| generated_ids = model.generate( | |
| **model_inputs, | |
| max_new_tokens=2048, | |
| temperature=0.7, | |
| do_sample=True, | |
| ) | |
| generated_ids = [ | |
| output_ids[len(input_ids):] | |
| for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) | |
| ] | |
| response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
| print(response) | |
| ``` | |
| ### Tool-Integrated Reasoning (TIR) | |
| ```python | |
| system_prompt = ( | |
| "Please integrate natural language reasoning with programs to solve the problem above, " | |
| "and put your final answer within \\boxed{}." | |
| ) | |
| ``` | |
| --- | |
| ## Evaluation Prompt Format | |
| **CoT (8-shot for GSM8K, 4-shot for MATH-500):** | |
| ``` | |
| <|im_start|>system | |
| Please reason step by step, and put your final answer within \boxed{}.<|im_end|> | |
| <|im_start|>user | |
| {problem}<|im_end|> | |
| <|im_start|>assistant | |
| ``` | |
| **TIR (zero-shot):** | |
| ``` | |
| <|im_start|>system | |
| Please integrate natural language reasoning with programs to solve the problem above, | |
| and put your final answer within \boxed{}.<|im_end|> | |
| <|im_start|>user | |
| {problem}<|im_end|> | |
| <|im_start|>assistant | |
| ``` | |
| --- | |
| ## Training Details | |
| ### Data Curation | |
| Training problems are sourced from **MATH-500** and filtered by difficulty using a learnable-zone criterion: a problem is retained if, among 8 sampled solutions from the base model, **between 2 and 5 are correct**. This yields 100 training problems that provide meaningful gradient signal — neither trivially easy nor intractably hard. | |
| ### Training Procedure | |
| 1. **Group sampling:** For each prompt, sample G=4 responses from the current policy. | |
| 2. **Reward computation:** Rule-based binary reward (correctness via symbolic answer matching) + small format bonus (α=0.1) for well-formed `\boxed{}` output. | |
| 3. **Advantage computation:** Compute length-penalized group z-score advantages. | |
| 4. **Policy update:** Maximize LPRO objective for 4 epochs per iteration. | |
| 5. **Iterate:** Set old policy ← new policy and repeat. | |
| ### Reward Function | |
| $$r_i = \mathbf{1}[\hat{a}(o_i) = a^*] + 0.1 \cdot \mathbf{1}[\text{format}(o_i)]$$ | |
| where $\hat{a}(o_i)$ is the extracted answer from the last `\boxed{}` expression, verified via symbolic equivalence. | |
| --- | |
| ## Limitations | |
| - **Scale:** Nexus-1.5B operates at 1.54B parameters. Hard olympiad problems (e.g., AIME) remain challenging for models at this scale. | |
| - **Language:** Primarily optimized for English and Chinese mathematical text. Performance on other languages is not evaluated. | |
| - **Domain:** Designed for mathematical reasoning. General language understanding or instruction-following tasks are outside the model's training distribution. | |
| - **TIR dependency:** Tool-integrated reasoning requires a sandboxed Python interpreter at inference time. | |
| --- | |
| ## Citation | |
| If you use Nexus-1.5B in your research, please cite: | |
| ```bibtex | |
| @techreport{neuriton2026nexus, | |
| title = {Nexus-1.5B: Length-Penalized Reward Optimization for Robust Mathematical Reasoning}, | |
| author = {Neuriton Team}, | |
| institution = {Neuriton}, | |
| year = {2026}, | |
| month = {Summer}, | |
| note = {Technical Report} | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgements | |
| We thank the Qwen Team at Alibaba Group for open-sourcing the Qwen2.5-Math model family, and the authors of DAPO for the asymmetric clipping insight that is central to LPRO. | |
| --- | |
| *Developed by [Neuriton](https://neuriton.ai) · Summer 2026* |