File size: 15,283 Bytes
4e31e9f
 
85de076
 
 
 
4e31e9f
85de076
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e31e9f
 
85de076
 
 
4e31e9f
85de076
4e31e9f
85de076
4e31e9f
85de076
4e31e9f
85de076
4e31e9f
85de076
 
 
 
 
 
 
4e31e9f
85de076
4e31e9f
85de076
4e31e9f
85de076
 
 
4e31e9f
85de076
4e31e9f
85de076
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e31e9f
85de076
4e31e9f
85de076
4e31e9f
85de076
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e31e9f
85de076
4e31e9f
85de076
 
 
 
 
 
 
 
4e31e9f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
base_model: Qwen/Qwen3-8B-Base
library_name: transformers
pipeline_tag: text-generation
datasets:
- OpenDataArena/ODA-Math-460k
tags:
- qwen3
- sft
- opendataarena
- oda-math
- math
- reasoning
license: cc-by-nc-4.0
language:
- en
metrics:
- accuracy
---

# Qwen3-8B-ODA-Math-460k
<img src="performance.png" alt="Leaderboard Performance" width="1200" />

Qwen3-8B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen3-8B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**.

ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**.  
It targets a β€œ**learnable but challenging**” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.

---

## 🧠 Model Summary

- **Base Model**: `Qwen/Qwen3-8B-Base`
- **Training Data**: `OpenDataArena/ODA-Math-460k`
- **Domain Coverage**: Mathematics (strictly filtered)
- **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline)
- **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.

---

## βš™οΈ Training Data Curation Pipeline

ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.

### 1️⃣ Data Collection

We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math.

### 2️⃣ Deduplication & Decontamination

We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.

### 3️⃣ Question Filtering (Quality & Suitability)

A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβ€”leaving predominantly **free-form** problems with objectively verifiable answers.

### πŸ“Š Filtration Statistics

| Pipeline Stage | Count | Percentage | 
|---|---:|---:|
| Raw Collection | 11.4M | 100% | 
| Dedup & Decontamination | 4.3M | 37.7% |
| Question Filtering | 3.3M | 28.9% |
| Stage-1 Filtering | 815.3K | 7.2% |
| Stage-2 Filtering | 459.6K | 4.0% |

---

## 🎯 Data Selection

Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**.

### Stage-1: Lower-Bound Filtering

Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct).

### Stage-2: Upper-Bound Filtering

Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it).

---

## βœ… Distillation & Verification

### πŸ§ͺ Response Synthesis

We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem.

### πŸ” Response Verification

We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβ€”so the released dataset contains **verified solutions only**.

---

## πŸ“š Training Data Source Composition

ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:

| Source | Count | Percentage |
|---|---:|---:|
| ScaleQuest-Math | 87,755 | 19.09% |
| NuminaMath-CoT | 75,971 | 16.53% |
| OpenMathInstruct-2 | 65,688 | 14.29% |
| MegaScience (math) | 54,904 | 11.94% |
| OpenMathReasoning | 49,463 | 10.76% |
| AM-Thinking-Distilled | 38,375 | 8.35% |
| MiroMind-M1-SFT-719K | 23,417 | 5.09% |
| SCP-116K | 16,066 | 3.50% |
| DeepMath-309K | 11,956 | 2.60% |
| math-gpt-4o-200k | 8,355 | 1.82% |
| OpenR1-Math-220k | 7,999 | 1.74% |
| MathFusionQA | 6,510 | 1.42% |

---

## πŸ”¬ Content Characteristics

### πŸ“˜ Subject Distribution

<img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" />

ODA-Math-460k maintains a **more balanced** subject composition than several peers:
- Algebra remains substantial (**~44.8%**),
- Geometry roughly **20–22%**,
- Calculus, Discrete Math & Probability, and Number Theory each around **~11%**.

This mitigates subject bias and reduces performance drops on underrepresented topics.

### πŸ“‰ Difficulty Distribution

Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings).

| Level | Equivalent Competition Tier | Description |
| :--- | :--- | :--- |
| **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. |
| **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
| **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
| **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
| **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
| **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. |
| **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
| **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. |
| **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. |
| **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. |

<img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" />

ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:

- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.

---

## πŸ“ˆ Performance

ODA-Math-460k is evaluated as an SFT corpus for **Qwen3-8B-Base**.

Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks.

<div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;">
    <table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;">
        <caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption>
        <thead>
            <tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;">
                <th style="text-align: left; padding: 8px;">Dataset</th>
                <th>Size</th>
                <th>GSM8K</th>
                <th>Math500</th>
                <th>Omni-Math</th>
                <th>Olympiad</th>
                <th>AIME'24</th>
                <th>AIME'25</th>
                <th>CMIMC'25</th>
                <th>HMMT'25</th>
                <th>BRUMO'25</th>
                <th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th>
            </tr>
        </thead>
        <tbody>
            <tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;">
                <td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen3-8B-Base</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;">Qwen3-8B-Base</td>
                <td>-</td><td>92.0</td><td>79.6</td><td>30.6</td><td>47.2</td><td>6.7</td><td>10.8</td><td>4.7</td><td>0.0</td><td>16.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">32.0</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td>
                <td>817</td><td>83.9</td><td>69.0</td><td>21.8</td><td>31.3</td><td>12.5</td><td>8.8</td><td>2.2</td><td>1.7</td><td>13.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">27.2</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td>
                <td>414k</td><td>93.4</td><td>84.8</td><td>35.8</td><td>57.6</td><td>25.4</td><td>17.9</td><td>11.3</td><td>12.1</td><td>33.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">41.3</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td>
                <td>8k</td><td>92.8</td><td>86.6</td><td>39.6</td><td>61.0</td><td>28.8</td><td>25.8</td><td>14.1</td><td>13.3</td><td>34.2</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">44.0</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td>
                <td>79k</td><td>93.8</td><td>92.6</td><td>48.5</td><td>69.7</td><td>54.6</td><td>31.3</td><td>22.8</td><td>25.0</td><td>48.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.1</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td>
                <td>50k</td><td>93.9</td><td>93.8</td><td>58.8</td><td>71.5</td><td>58.8</td><td>45.8</td><td>28.4</td><td>32.9</td><td>54.2</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">59.8</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td>
                <td>719k</td><td><u>94.8</u></td><td><b>96.8</b></td><td>54.5</td><td><u>77.0</u></td><td>62.9</td><td>47.5</td><td>25.6</td><td>27.5</td><td>60.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">60.8</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td>
                <td>365k</td><td>94.2</td><td>95.4</td><td>59.0</td><td>74.9</td><td><b>67.9</b></td><td>45.4</td><td>31.3</td><td>35.8</td><td>52.5</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">61.8</td>
            </tr>
            <tr>
                <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td>
                <td>558k</td><td><b>95.2</b></td><td>95.6</td><td><u>64.5</u></td><td><b>77.5</b></td><td>65.8</td><td><u>54.6</u></td><td><u>36.3</u></td><td><u>41.3</u></td><td><u>62.5</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>65.9</u></td>
            </tr>
            <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;">
                <td style="text-align: left; padding: 8px;">ODA-Math</td>
                <td>460k</td><td>94.3</td><td><u>96.0</u></td><td><b>66.9</b></td><td>76.3</td><td><b>67.9</b></td><td><b>63.3</b></td><td><b>41.6</b></td><td><b>45.4</b></td><td><b>67.5</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>68.8</b></td>
            </tr>
        </tbody>
    </table>
</div>

---

## 🌐 About OpenDataArena

[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.

**Key Features:**
- πŸ† **Dataset Leaderboard** β€” helps researchers identify **the most valuable and high-quality datasets across different domains**.
- πŸ“Š **Detailed Evaluation Scores** β€” provides **comprehensive metrics** to assess data quality, complexity, difficulty etc.
- 🧰 **Data Processing Toolkit** β€” [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool)
  offers an open-source pipeline for dataset curation and scoring.

If you find our work helpful, please consider **⭐ starring and subscribing** to support our research.

---

## πŸš€ Usage

Model repo: `OpenDataArena/Qwen3-8B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "OpenDataArena/Qwen3-8B-ODA-Math-460k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## πŸ“š Citation

```bibtex
@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}
```