netease-youdao
/

Confucius3-Math

+---
+license: apache-2.0
+language:
+- en
+base_model: Qwen/DeepSeek-R1-Distill-Qwen-14B
+tags:
+- chat
+library_name: transformers
+---
+# Confucius3-Math
+## Introduction
+**Confucius3-Math** is a reasoning model developed by the NetEase Youdao Team. It is a large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. Built via post-training with a pure large-scale reinforcement learning (RL) process, Confucius3-Math is the first open-source model that excels at solving main stream Chinese K-12 mathematical problems with low cost.
+## Model Summary
+**Selection of the Base Model**
+- We selected the open-source [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) model as the initial model for training Confucius3-Math. This choice was made because the model already exhibits robust chain-of-thought capabilities and holds a greater initial edge in the field of mathematics compared to other models of comparable scale. Moreover, the responses in the answer section provided by this model align with our expectations for an educational model.
+---
+**Reinforcement Learning**
+- In particular, we introduce Recent Sample Recovery and Policy-Specific Hardness Weighting, one novel data scheduling policy and one improved group-relative advantage estimator, respectively, that significantly improves data efficiency, stabilize the RL training process, and boost performance. By integrating the DAPO algorithm with the two newly proposed technologies, we have achieved SOTA results on several Chinese K12 mathematics test sets.
+---
+**Data Formatting**
+- We standardize the output format of the model as follows: the chain-of-thought process is output in the `<think></think>` block, and then the step-by-step problem-solving process is summarized in the `<answer></answer>` block.
+---
+**Composition of Training Data**
+- The data used for training the model comes from two major sources: open-source and proprietary. To enhance the model’s mathematical capabilities, we collect a large number of open-source English mathematics dataset. For our proprietary data, We also collect Chinese K12 math questions, and their solutions, accumulated during the operation of our business. They cover various mathematics problems for the domestic K-12 stage (primary, middle, and high school), including a rich variety of types, such as single-choice, multiple-choice, true/false, fill-in-the-blank, calculation, proof, and mixtures of multiple question types etc.
+---
+**More Stringent Data filtering**
+- To ensure the quality and diversity of the learning data, we have implemented a rigorous preprocessing procedure for our data. For open-source data, we execute the following data processing workflow in sequence: exact deduplication, fuzzy deduplication, semantic deduplication and question type selection. For our proprietary data, we apply a cleaning stage additionally, as the data originates from mass-scale automated entry with manual correction, it inherently contains significant noise. After filtering, we retained approximately 540,000 examples of data for actual training, including 210,000 from open-source data and 330,000 from proprietary data.
+## Evaluation and Results
+For each model, we used the official system prompt provided by the corresponding model and took the question in the test set as the user content. For R1, we did not use the System Prompt because our evaluation found that this could achieve higher quality. The maximum generated response length of the models was uniformly set to 32,768 tokens. We used the sampling strategy to generate k response results and reported the pass@1 results. Specifically, for our model, we used a sampling temperature of 1.0 and a top-p value of 0.7, while for other models, we used the sampling parameters recommended by the official documentation. Regarding the setting of k, different test sets used different values. For MATH500, AIME24, and AIME25, we followed DeepSeek’s setting of k = 64, and for other test sets, k is set to approximately 2, 000/N where N is the number of samples in the set.
+<div align="center">
+| Benchmark | DeepSeek-R1 | Qwen3-14B | QwQ-32B | DeepSeek-R1-Distill-Qwen-14B | Confucius3-Math |
+|-------------------|----------------------|------------|--------------|----------------|------------|
+| CK12-MATH | 92.74 | 94.04 | 93.60 | 82.86 | 96.24 |
+| GAOKAO-Bench(math) | 93.27 | 94.44 | 94.93 | 86.75 | 98.46 |
+| MathBench(K12) | 89.99 | 96.51 | 96.57 | 88.40 | 95.10 |
+| CMATH | 95.81 | 95.90 | 95.95 | 77.41 | 96.13 |
+| MATH-500 | 97.30 | 96.80 | 98.00 | 93.90 | 98.80 |
+| AIME 2024 | 79.80 | 79.30 | 79.50 | 69.70 | 81.15 |
+| AIME 2025 | 70.00 | 70.40 | 69.50 | 42.97 | 69.95 |
+</div>
+## Limitations
+However, there are some limitations that must be stated in advance:
+1. **Scenario Limitations**: Our optimization is only carried out on data from the K12 mathematics scenario, and the effectiveness has only been verified in math-related benchmark tests. The performance of the model in non-mathematical scenarios has not been tested, so we cannot guarantee its quality and effectiveness in other fields.
+2. **Invalid Results**: The model may sometimes fall into circular reasoning. Since we use explicit identifiers to divide the thinking and summary parts, when the model enters this mode, it may generate invalid results that cannot be parsed.
+3. **Safety and Ethics**: This model has not undergone optimization and testing for alignment at the safety and ethical levels. Any output generated by the model does not represent the official positions, views, or attitudes of our company. When using this model, users should independently judge and evaluate the rationality and applicability of the output content and comply with relevant laws, regulations, and social ethics.
+## Quickstart
+The environmental requirements for running it are exactly the same as those of the [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) model. Therefore, you can easily use Transformers or vLLM to load and run the model for inference, and deploy your services.
+The only thing you need to pay attention to is to use the predefined system message and user message templates provided below to request the model. Other templates may also be usable, but we haven't tested them yet.
+```python
+SYSTEM_PROMPT_TEMPLATE = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>."""
+USER_PROMPT_TEMPLATE = """{question}"""
+```
+Then you can create your `messages` as follows and use them to request model results. You just need to fill in your instructions in the "question" field.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "netease-youdao/Confucius3-Math"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+messages = [
+    {'role': 'system', 'content': SYSTEM_PROMPT_TEMPLATE},
+    {'role': 'user', 'content': USER_PROMPT_TEMPLATE.format(question=question)},
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=16384
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
+After obtaining the model results, you can parse out the "thinking" and "summary" parts as follows.
+```python
+def parse_result_nostep(result):
+     think_pattern = r"<think>(.*?)</think>(.*)"
+    think_list = re.findall(think_pattern, result, re.DOTALL)
+    assert len(think_list) == 1, \
+        f"The parsing results do not meet the expectations.\n{result}"
+    think = think_list[0][0].strip()
+    summary = think_list[0][1].strip()
+    return think, summary
+thinking, summary = parse_result_nostep(response)
+```
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@misc{confucius3-math,
+   author = {NetEase Youdao Team},
+   title = {Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning},
+   url = {https://huggingface.co/netease-youdao/Confucius3-Math},
+   month = {June},
+   year = {2025}
+ }
+```