File size: 6,170 Bytes
975db63
fbcd116
975db63
 
071ec82
975db63
 
 
 
 
 
30004d9
 
 
975db63
 
7f9f1f6
 
 
 
 
 
779dfce
7f9f1f6
 
1c6d011
8cab89e
 
 
 
 
 
 
 
 
7f9f1f6
 
 
 
975db63
8cab89e
975db63
 
 
 
 
 
7f9f1f6
700052f
 
7f9f1f6
 
 
 
975db63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f9f1f6
975db63
 
 
 
 
 
 
995a3fa
a0b147c
975db63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6262149
975db63
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
license: mit
language:
- en
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
tags:
- chat
library_name: transformers
---

# Confucius3-Math
<p align="center">
  <img width="25%" src="figures/confucius_logo.png">
</p>

## Introduction
<div style="display: flex; justify-content: center; gap: 20px;">

<a href="https://confucius.youdao.com">💬DEMO</a>

<a href="https://github.com/netease-youdao/Confucius3-Math">🌟Github</a>

<a href="https://github.com/netease-youdao/Confucius3-Math/blob/main/Confucius3-Math.pdf">🎓Paper</a>

</div>

**Confucius3-Math** is a 14B parameter open-source resoning LLM developed by the NetEase Youdao AI Team, specifically optimized for K-12 mathematics education. Unlike general-purpose models, Confucius3-Math:
- ✅ SOTA Performance on Math Tasks
Outperforms larger models on Chinese K-12 math problems through specialized RL training
- ✅ Cost-Effective Deployment
Runs efficiently on a single consumer-grade GPU (e.g., RTX 4090D)
- ✅ Cultural & Curriculum Alignment
Optimized for China's national mathematics standards and problem-solving methodologies

Confucius3-Math was developed through an RL-only post-training process with novel data scheduling policy and improved group-relative advantage estimator. Please refer to our technical report  for details.
<p></p>
<p align="center">
  <img width="85%" src="figures/benchmark.png">
</p>

## Evaluation Results

<div align="center">


| Benchmark | DeepSeek-R1 | Qwen3-14B | QwQ-32B | DeepSeek-R1-Distill-Qwen-14B | Confucius3-Math |
|-------------------|----------------------|------------|--------------|----------------|------------|
| CK12-MATH | 92.74 | 94.04 | 93.60 | 82.86 | **96.24** |
| GAOKAO-Bench (math) | 93.27 | 94.44 | 94.93 | 86.75 | **98.46** |
| MathBench (K12) | 89.99 | 96.51 | **96.57** | 88.40 | 95.10 |
| CMATH | 95.81 | 95.90 | 95.95 | 77.41 | **96.13** |
| MATH-500 | 97.30 | 96.80 | 98.00 | 93.90 | **98.80** |
| AIME 2024 | 79.80 | 79.30 | 79.50 | 69.70 | **81.15** |
| AIME 2025 | 70.00 | **70.40** | 69.50 | 42.97 | 69.95 |

</div>



## Limitations

However, there are some limitations that must be stated in advance:
1. **Scenario Limitations**: Our optimization is only carried out on data from the K12 mathematics scenario, and the effectiveness has only been verified in math-related benchmark tests. The performance of the model in non-mathematical scenarios has not been tested, so we cannot guarantee its quality and effectiveness in other fields.
2. **Invalid Results**: The model may sometimes fall into circular reasoning. Since we use explicit identifiers to divide the thinking and summary parts, when the model enters this mode, it may generate invalid results that cannot be parsed.
3. **Safety and Ethics**: This model has not undergone optimization and testing for alignment at the safety and ethical levels. Any output generated by the model does not represent the official positions, views, or attitudes of our company. When using this model, users should independently judge and evaluate the rationality and applicability of the output content and comply with relevant laws, regulations, and social ethics.

## Quickstart
The environmental requirements for running it are exactly the same as those of the [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) model. Therefore, you can easily use Transformers or vLLM to load and run the model for inference, and deploy your services.

The only thing you need to pay attention to is to use the predefined system message and user message templates provided below to request the model. Other templates may also be usable, but we haven't tested them yet.
```python
SYSTEM_PROMPT_TEMPLATE = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>."""

USER_PROMPT_TEMPLATE = """{question}"""
```

Then you can create your `messages` as follows and use them to request model results. You just need to fill in your instructions in the "question" field.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "netease-youdao/Confucius3-Math"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT_TEMPLATE},
    {'role': 'user', 'content': USER_PROMPT_TEMPLATE.format(question=question)},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
> [!NOTE]
> **Generate Parameters**: We suggest using Temperature=1.0, TopP=0.7 to sample.

After obtaining the model results, you can parse out the "thinking" and "summary" parts as follows.
```python
def parse_result_nostep(result):
     think_pattern = r"<think>(.*?)</think>(.*)"

    think_list = re.findall(think_pattern, result, re.DOTALL)

    assert len(think_list) == 1, \
        f"The parsing results do not meet the expectations.\n{result}"

    think = think_list[0][0].strip()
    summary = think_list[0][1].strip()
    return think, summary
    
thinking, summary = parse_result_nostep(response)
```

## Citation
If you find our work helpful, feel free to give us a cite.
```
@misc{confucius3-math,
   author = {NetEase Youdao Team},
   title = {Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning},
   url = {https://arxiv.org/abs/2506.18330},
   month = {June},
   year = {2025}
 }
```