GX-XinGao commited on
Commit
f3ee1c0
Β·
verified Β·
1 Parent(s): 4d227ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +277 -39
README.md CHANGED
@@ -1,60 +1,298 @@
1
  ---
 
2
  library_name: transformers
3
- license: other
4
- base_model: Qwen/Qwen2.5-7B
 
5
  tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
- model-index:
10
- - name: seed-42
11
- results: []
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # seed-42
18
 
19
- This model is a fine-tuned version of [/mnt/shared-storage-user/caimengzhang/model/base-model/Qwen2.5-7B](https://huggingface.co//mnt/shared-storage-user/caimengzhang/model/base-model/Qwen2.5-7B) on the 450k-fail-0-08-answer-am-qwen-correct dataset.
 
20
 
21
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- More information needed
24
 
25
- ## Intended uses & limitations
 
 
 
26
 
27
- More information needed
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
32
 
33
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- ### Training hyperparameters
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 5e-05
39
- - train_batch_size: 2
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 24
44
- - total_train_batch_size: 48
45
- - total_eval_batch_size: 192
46
- - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
47
- - lr_scheduler_type: cosine
48
- - lr_scheduler_warmup_ratio: 0.03
49
- - num_epochs: 3.0
50
 
51
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
 
53
 
 
 
 
 
 
 
 
 
54
 
55
- ### Framework versions
56
 
57
- - Transformers 4.49.0
58
- - Pytorch 2.8.0+cu128
59
- - Datasets 3.2.0
60
- - Tokenizers 0.21.0
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-7B-Base
3
  library_name: transformers
4
+ pipeline_tag: text-generation
5
+ datasets:
6
+ - OpenDataArena/ODA-Math-460k
7
  tags:
8
+ - qwen2.5
9
+ - sft
10
+ - opendataarena
11
+ - oda-math
12
+ - math
13
+ - reasoning
14
+ license: cc-by-nc-4.0
15
+ language:
16
+ - en
17
+ metrics:
18
+ - accuracy
19
  ---
20
 
21
+ # Qwen2.5-7B-ODA-Math-460k
22
+ <img src="performance.png" alt="Leaderboard Performance" width="1200" />
23
 
24
+ Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**.
25
 
26
+ ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**.
27
+ It targets a β€œ**learnable but challenging**” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.
28
 
29
+ ---
30
+
31
+ ## 🧠 Model Summary
32
+
33
+ - **Base Model**: `Qwen/Qwen2.5-7B-Base`
34
+ - **Training Data**: `OpenDataArena/ODA-Math-460k`
35
+ - **Domain Coverage**: Mathematics (strictly filtered)
36
+ - **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline)
37
+ - **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.
38
+
39
+ ---
40
+
41
+ ## βš™οΈ Training Data Curation Pipeline
42
+
43
+ ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.
44
+
45
+ ### 1️⃣ Data Collection
46
+
47
+ We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math.
48
+
49
+ ### 2️⃣ Deduplication & Decontamination
50
+
51
+ We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.
52
+
53
+ ### 3️⃣ Question Filtering (Quality & Suitability)
54
+
55
+ A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβ€”leaving predominantly **free-form** problems with objectively verifiable answers.
56
+
57
+ ### πŸ“Š Filtration Statistics
58
+
59
+ | Pipeline Stage | Count | Percentage |
60
+ |---|---:|---:|
61
+ | Raw Collection | 11.4M | 100% |
62
+ | Dedup & Decontamination | 4.3M | 37.7% |
63
+ | Question Filtering | 3.3M | 28.9% |
64
+ | Stage-1 Filtering | 815.3K | 7.2% |
65
+ | Stage-2 Filtering | 459.6K | 4.0% |
66
+
67
+ ---
68
+
69
+ ## 🎯 Data Selection
70
+
71
+ Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**.
72
+
73
+ ### Stage-1: Lower-Bound Filtering
74
+
75
+ Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct).
76
+
77
+ ### Stage-2: Upper-Bound Filtering
78
+
79
+ Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it).
80
+
81
+ ---
82
+
83
+ ## βœ… Distillation & Verification
84
+
85
+ ### πŸ§ͺ Response Synthesis
86
+
87
+ We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem.
88
+
89
+ ### πŸ” Response Verification
90
+
91
+ We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβ€”so the released dataset contains **verified solutions only**.
92
+
93
+ ---
94
+
95
+ ## πŸ“š Training Data Source Composition
96
+
97
+ ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:
98
+
99
+ | Source | Count | Percentage |
100
+ |---|---:|---:|
101
+ | ScaleQuest-Math | 87,755 | 19.09% |
102
+ | NuminaMath-CoT | 75,971 | 16.53% |
103
+ | OpenMathInstruct-2 | 65,688 | 14.29% |
104
+ | MegaScience (math) | 54,904 | 11.94% |
105
+ | OpenMathReasoning | 49,463 | 10.76% |
106
+ | AM-Thinking-Distilled | 38,375 | 8.35% |
107
+ | MiroMind-M1-SFT-719K | 23,417 | 5.09% |
108
+ | SCP-116K | 16,066 | 3.50% |
109
+ | DeepMath-309K | 11,956 | 2.60% |
110
+ | math-gpt-4o-200k | 8,355 | 1.82% |
111
+ | OpenR1-Math-220k | 7,999 | 1.74% |
112
+ | MathFusionQA | 6,510 | 1.42% |
113
+
114
+ ---
115
+
116
+ ## πŸ”¬ Content Characteristics
117
+
118
+ ### πŸ“˜ Subject Distribution
119
 
120
+ <img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" />
121
 
122
+ ODA-Math-460k maintains a **more balanced** subject composition than several peers:
123
+ - Algebra remains substantial (**~44.8%**),
124
+ - Geometry roughly **20–22%**,
125
+ - Calculus, Discrete Math & Probability, and Number Theory each around **~11%**.
126
 
127
+ This mitigates subject bias and reduces performance drops on underrepresented topics.
128
 
129
+ ### πŸ“‰ Difficulty Distribution
130
 
131
+ Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings).
132
 
133
+ | Level | Equivalent Competition Tier | Description |
134
+ | :--- | :--- | :--- |
135
+ | **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. |
136
+ | **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
137
+ | **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
138
+ | **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
139
+ | **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
140
+ | **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. |
141
+ | **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
142
+ | **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. |
143
+ | **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. |
144
+ | **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. |
145
 
146
+ <img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" />
147
 
148
+ ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ - Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
151
+ - Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
152
+ - Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.
153
+
154
+ ---
155
+
156
+ ## πŸ“ˆ Performance
157
+
158
+ ODA-Math-460k is evaluated as an SFT corpus for **Qwen2.5-7B-Base**.
159
+
160
+ Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks.
161
+
162
+ <div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;">
163
+ <table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;">
164
+ <caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption>
165
+ <thead>
166
+ <tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;">
167
+ <th style="text-align: left; padding: 8px;">Dataset</th>
168
+ <th>Size</th>
169
+ <th>GSM8K</th>
170
+ <th>Math500</th>
171
+ <th>Omni-Math</th>
172
+ <th>Olympiad</th>
173
+ <th>AIME'24</th>
174
+ <th>AIME'25</th>
175
+ <th>CMIMC'25</th>
176
+ <th>HMMT'25</th>
177
+ <th>BRUMO'25</th>
178
+ <th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th>
179
+ </tr>
180
+ </thead>
181
+ <tbody>
182
+ <tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;">
183
+ <td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td>
184
+ </tr>
185
+ <tr>
186
+ <td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td>
187
+ <td>-</td><td>80.0</td><td>50.2</td><td>26.0</td><td>35.9</td><td>6.7</td><td>6.7</td><td>10.0</td><td>0.0</td><td>20.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.2</td>
188
+ </tr>
189
+ <tr>
190
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td>
191
+ <td>817</td><td>92.1</td><td>66.8</td><td>21.6</td><td>34.9</td><td>4.6</td><td>1.7</td><td>0.0</td><td>0.0</td><td>5.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">25.2</td>
192
+ </tr>
193
+ <tr>
194
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/nvidia/OpenMathInstruct-2">OpenMathInstruct-2</a></td>
195
+ <td>1M</td><td>91.6</td><td>65.9</td><td>22.5</td><td>30.7</td><td>6.7</td><td>5.0</td><td>5.0</td><td>0.0</td><td>13.6</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.8</td>
196
+ </tr>
197
+ <tr>
198
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td>
199
+ <td>414k</td><td>90.1</td><td>77.8</td><td>28.7</td><td>44.5</td><td>16.7</td><td>15.0</td><td>8.1</td><td>0.0</td><td>26.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">34.2</td>
200
+ </tr>
201
+ <tr>
202
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td>
203
+ <td>8k</td><td>90.6</td><td>80.0</td><td>35.8</td><td>50.3</td><td>23.3</td><td>26.7</td><td>7.5</td><td>8.3</td><td>31.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">39.4</td>
204
+ </tr>
205
+ <tr>
206
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/zwhe99/DeepMath-103K">DeepMath-103K</a></td>
207
+ <td>103k</td><td>92.1</td><td>92.0</td><td>45.4</td><td>60.2</td><td>34.2</td><td>31.7</td><td>10.0</td><td>11.7</td><td>15.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">43.6</td>
208
+ </tr>
209
+ <tr>
210
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td>
211
+ <td>79k</td><td>92.0</td><td>88.0</td><td>43.3</td><td>60.2</td><td>38.3</td><td>26.7</td><td>22.5</td><td>13.3</td><td>38.3</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">47.0</td>
212
+ </tr>
213
+ <tr>
214
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td>
215
+ <td>50k</td><td>92.1</td><td>90.0</td><td>54.5</td><td>67.4</td><td>45.0</td><td>35.0</td><td>19.7</td><td>20.0</td><td>36.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">51.2</td>
216
+ </tr>
217
+ <tr>
218
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td>
219
+ <td>719k</td><td><u>93.9</u></td><td>91.6</td><td>48.1</td><td>66.3</td><td>55.0</td><td>30.0</td><td>27.5</td><td>18.3</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.4</td>
220
+ </tr>
221
+ <tr>
222
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td>
223
+ <td>365k</td><td>93.2</td><td>89.8</td><td>54.3</td><td>68.1</td><td>50.4</td><td>40.0</td><td>25.0</td><td>28.3</td><td>45.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.9</td>
224
+ </tr>
225
+ <tr>
226
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M">OpenThoughts3</a></td>
227
+ <td>1.2M</td><td>91.7</td><td>93.8</td><td>44.8</td><td>68.8</td><td><u>60.0</u></td><td>45.0</td><td>27.5</td><td>31.7</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">57.0</td>
228
+ </tr>
229
+ <tr>
230
+ <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td>
231
+ <td>558k</td><td>92.9</td><td><b>96.2</b></td><td><u>60.6</u></td><td><b>74.2</b></td><td><b>63.3</b></td><td><u>50.0</u></td><td><u>27.8</u></td><td><u>36.7</u></td><td><b>63.3</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>62.8</u></td>
232
+ </tr>
233
+ <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;">
234
+ <td style="text-align: left; padding: 8px;">ODA-Math</td>
235
+ <td>460k</td><td><b>94.3</b></td><td><u>95.4</u></td><td><b>62.6</b></td><td><u>70.9</u></td><td>56.7</td><td><b>56.7</b></td><td><b>35.0</b></td><td><b>45.0</b></td><td><u>60.0</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>64.1</b></td>
236
+ </tr>
237
+ </tbody>
238
+ </table>
239
+ </div>
240
+
241
+ ---
242
+
243
+ ## 🌐 About OpenDataArena
244
+
245
+ [OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
246
+
247
+ **Key Features:**
248
+ - πŸ† **Dataset Leaderboard** β€” helps researchers identify **the most valuable and high-quality datasets across different domains**.
249
+ - πŸ“Š **Detailed Evaluation Scores** β€” provides **comprehensive metrics** to assess data quality, complexity, difficulty etc.
250
+ - 🧰 **Data Processing Toolkit** β€” [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool)
251
+ offers an open-source pipeline for dataset curation and scoring.
252
+
253
+ If you find our work helpful, please consider **⭐ starring and subscribing** to support our research.
254
+
255
+ ---
256
+
257
+ ## πŸš€ Usage
258
+
259
+ Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference:
260
+
261
+ ```python
262
+ from transformers import AutoModelForCausalLM, AutoTokenizer
263
+
264
+ MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k"
265
+
266
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
267
+ model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
268
+
269
+ messages = [
270
+ {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
271
+ ]
272
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
273
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
274
+
275
+ outputs = model.generate(
276
+ **inputs,
277
+ max_new_tokens=512,
278
+ do_sample=True,
279
+ temperature=0.7,
280
+ top_p=0.9,
281
+ )
282
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
283
+ ```
284
+
285
+ ---
286
 
287
+ ## πŸ“š Citation
288
 
289
+ ```bibtex
290
+ @article{cai2025opendataarena,
291
+ title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
292
+ author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
293
+ journal={arXiv preprint arXiv:2512.14051},
294
+ year={2025}
295
+ }
296
+ ```
297
 
 
298