Add `library_name: transformers` to metadata (#1)

5f31583 verified 3 months ago

5.91 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	pipeline_tag: text-generation
	tags:
	- reasoning
	- small-language-model
	- efficient-training
	- xmodel
	- xiaoduo-ai
	library_name: transformers
	---

	# Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM

	<h5 align="center">

	[![hf_space](https://img.shields.io/badge/🤗-Xiaoduo%20HuggingFace-blue.svg)](https://huggingface.co/XiaoduoAILab/Xmodel-2.5)
	[![arXiv](https://img.shields.io/badge/Arxiv-2511.19496-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2511.19496)
	[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/XiaoduoAILab/Xmodel-2.5/blob/main/LICENSE)
	[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/XiaoduoAILab/Xmodel-2.5)
	[![github](https://img.shields.io/github/stars/XiaoduoAILab/Xmodel-2.5.svg?style=social)](https://github.com/XiaoduoAILab/Xmodel-2.5)

	</h5>

	## Model Description

	Xmodel-2.5 is a 1.3 billion parameter small language model specifically designed as a lightweight agent core for complex reasoning tasks. The model builds upon Xmodel-2 with four key upgrades:

	1. Full μP Support: Extended Megatron-LM to support maximal update parameterization for reliable hyperparameter transfer
	2. Efficient Tokenizer: Adopted 129K token DeepSeek-v3 tokenizer for improved compression rate and decoding speed
	3. FP8 Mixed Precision: Used E4M3 forward and E5M2 backward FP8 formats to balance precision and throughput
	4. Optimizer Scheduling: Switched from AdamW to Muon during decay phase, significantly improving downstream task performance

	Trained with only 1.4T tokens, Xmodel-2.5 achieves 52.49% average accuracy across 13 reasoning benchmarks, ranking second among 1-2B parameter models, only behind Qwen3 (56.96%) but with 25.7x fewer training tokens.

	## Model Architecture

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Hidden size \| 1536 \|
	\| Intermediate size \| 3840 \|
	\| Transformer layers \| 48 \|
	\| Attention heads (Q) \| 24 \|
	\| KV heads (GQA) \| 8 \|
	\| Sequence length \| 3712 \|
	\| Max position embeddings \| 131072 \|
	\| RoPE base \| 500000 \|

	## Intended Uses & Limitations

	### Intended Uses
	- Complex reasoning tasks
	- Lightweight AI agent applications
	- Educational and research purposes
	- Resource-constrained environments

	### Limitations
	- Limited to 1.3B parameter capacity
	- May struggle with highly specialized domains
	- Performance may vary on non-English languages

	## Training Details

	### Training Strategy
	- Three-stage WSD curriculum: 560k steps, 1.4T tokens
	- Warmup phase: 2k steps, linear learning rate increase
	- Stable phase: 530k steps, gradually increasing batch size
	- Decay phase: 20k steps, mixing 66.9% high-quality SFT data
	- Long-context adaptation: 10k additional steps for 16K context support

	### Key Innovations
	- μP hyperparameter transfer: Direct transfer from 20M parameter proxy model to full model
	- Optimizer switching: AdamW → Muon during decay phase for improved reasoning performance
	- FP8 mixed precision: FP8 format significantly enhances training efficiency

	## Performance

	### Comprehensive Reasoning Performance

	\| Model \| Parameters \| Training Tokens \| 13-Task Average \|
	\|-------\|------------\|-----------------\|------------------\|
	\| Qwen3-1.7B \| 1.7B \| 36T \| 56.96% \|
	\| Xmodel-2.5 \| 1.3B \| 1.4T \| 52.49% \|
	\| InternLM2.5-1.8B \| 1.8B \| - \| 50.19% \|
	\| Xmodel-2-1.2B \| 1.2B \| 1.5T \| 50.34% \|
	\| MiniCPM-1B \| 1B \| - \| 48.95% \|
	\| SmolLM2-1.7B \| 1.7B \| 11T \| 46.88% \|
	\| Llama-3.2-1B \| 1B \| 9T \| 44.72% \|

	### Detailed Task Performance

	\| Task \| Xmodel-2.5 \| Xmodel-2 \| Improvement \|
	\|------\|------------\|----------\|-------------\|
	\| ARC-Challenge \| 48.89 \| 46.16 \| +2.73 \|
	\| ARC-Easy \| 76.94 \| 76.22 \| +0.72 \|
	\| PIQA \| 75.95 \| 75.14 \| +0.81 \|
	\| HellaSwag \| 67.24 \| 64.05 \| +3.19 \|
	\| WinoGrande \| 64.64 \| 64.25 \| +0.39 \|
	\| BBH \| 54.58 \| 48.90 \| +5.68 \|
	\| MMLU \| 51.81 \| 49.98 \| +1.83 \|
	\| GSM8k \| 58.98 \| 56.56 \| +2.42 \|
	\| MATH \| 28.94 \| 25.64 \| +3.30 \|
	\| HumanEval \| 28.66 \| 29.27 \| -0.61 \|
	\| MBPP \| 33.00 \| 30.80 \| +2.20 \|
	\| CMMLU \| 47.16 \| 44.29 \| +2.87 \|
	\| C-Eval \| 45.54 \| 43.16 \| +2.38 \|

	## How to Use

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import os

	model_path = "XiaoduoAILab/Xmodel-2.5"
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	model_path,
	trust_remote_code=True
	)

	prompt = "Explain the concept of transfer learning in machine learning."
	messages = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

	# Generation configuration
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512,
	do_sample=True,
	top_p=0.9,
	temperature=0.7,
	pad_token_id=tokenizer.eos_token_id
	)

	output = tokenizer.decode(
	generated_ids[0][len(model_inputs.input_ids[0]):],
	skip_special_tokens=True
	)
	print("Generated Response:")
	print(output)
	```

	## Citation

	If you find Xmodel-2.5 useful for your research or applications, please consider citing our work:

	```bibtex
	@misc{liu2025xmodel25,
	title={Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM},
	author={Yang Liu and Xiaolong Zhong and Ling Jiang},
	year={2025},
	eprint={2511.19496},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2511.19496},
	}
	```

	## Contact

	For questions or suggestions, please contact us through:
	- GitHub Issues: [Xmodel-2.5 Issues](https://github.com/XiaoduoAILab/Xmodel-2.5/issues)
	- Email: foamilu@yeah.net

	## License

	This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.