|
|
--- |
|
|
base_model: Qwen/Qwen3-8B-Base |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- OpenDataArena/ODA-Math-460k |
|
|
tags: |
|
|
- qwen3 |
|
|
- sft |
|
|
- opendataarena |
|
|
- oda-math |
|
|
- math |
|
|
- reasoning |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# Qwen3-8B-ODA-Math-460k |
|
|
<img src="performance.png" alt="Leaderboard Performance" width="1200" /> |
|
|
|
|
|
Qwen3-8B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen3-8B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**. |
|
|
|
|
|
ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**. |
|
|
It targets a β**learnable but challenging**β difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Summary |
|
|
|
|
|
- **Base Model**: `Qwen/Qwen3-8B-Base` |
|
|
- **Training Data**: `OpenDataArena/ODA-Math-460k` |
|
|
- **Domain Coverage**: Mathematics (strictly filtered) |
|
|
- **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline) |
|
|
- **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions. |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Training Data Curation Pipeline |
|
|
|
|
|
ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected. |
|
|
|
|
|
### 1οΈβ£ Data Collection |
|
|
|
|
|
We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math. |
|
|
|
|
|
### 2οΈβ£ Deduplication & Decontamination |
|
|
|
|
|
We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks. |
|
|
|
|
|
### 3οΈβ£ Question Filtering (Quality & Suitability) |
|
|
|
|
|
A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβleaving predominantly **free-form** problems with objectively verifiable answers. |
|
|
|
|
|
### π Filtration Statistics |
|
|
|
|
|
| Pipeline Stage | Count | Percentage | |
|
|
|---|---:|---:| |
|
|
| Raw Collection | 11.4M | 100% | |
|
|
| Dedup & Decontamination | 4.3M | 37.7% | |
|
|
| Question Filtering | 3.3M | 28.9% | |
|
|
| Stage-1 Filtering | 815.3K | 7.2% | |
|
|
| Stage-2 Filtering | 459.6K | 4.0% | |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Data Selection |
|
|
|
|
|
Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**. |
|
|
|
|
|
### Stage-1: Lower-Bound Filtering |
|
|
|
|
|
Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct). |
|
|
|
|
|
### Stage-2: Upper-Bound Filtering |
|
|
|
|
|
Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it). |
|
|
|
|
|
--- |
|
|
|
|
|
## β
Distillation & Verification |
|
|
|
|
|
### π§ͺ Response Synthesis |
|
|
|
|
|
We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem. |
|
|
|
|
|
### π Response Verification |
|
|
|
|
|
We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβso the released dataset contains **verified solutions only**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Data Source Composition |
|
|
|
|
|
ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors: |
|
|
|
|
|
| Source | Count | Percentage | |
|
|
|---|---:|---:| |
|
|
| ScaleQuest-Math | 87,755 | 19.09% | |
|
|
| NuminaMath-CoT | 75,971 | 16.53% | |
|
|
| OpenMathInstruct-2 | 65,688 | 14.29% | |
|
|
| MegaScience (math) | 54,904 | 11.94% | |
|
|
| OpenMathReasoning | 49,463 | 10.76% | |
|
|
| AM-Thinking-Distilled | 38,375 | 8.35% | |
|
|
| MiroMind-M1-SFT-719K | 23,417 | 5.09% | |
|
|
| SCP-116K | 16,066 | 3.50% | |
|
|
| DeepMath-309K | 11,956 | 2.60% | |
|
|
| math-gpt-4o-200k | 8,355 | 1.82% | |
|
|
| OpenR1-Math-220k | 7,999 | 1.74% | |
|
|
| MathFusionQA | 6,510 | 1.42% | |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Content Characteristics |
|
|
|
|
|
### π Subject Distribution |
|
|
|
|
|
<img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" /> |
|
|
|
|
|
ODA-Math-460k maintains a **more balanced** subject composition than several peers: |
|
|
- Algebra remains substantial (**~44.8%**), |
|
|
- Geometry roughly **20β22%**, |
|
|
- Calculus, Discrete Math & Probability, and Number Theory each around **~11%**. |
|
|
|
|
|
This mitigates subject bias and reduces performance drops on underrepresented topics. |
|
|
|
|
|
### π Difficulty Distribution |
|
|
|
|
|
Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings). |
|
|
|
|
|
| Level | Equivalent Competition Tier | Description | |
|
|
| :--- | :--- | :--- | |
|
|
| **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. | |
|
|
| **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. | |
|
|
| **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. | |
|
|
| **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. | |
|
|
| **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. | |
|
|
| **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. | |
|
|
| **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. | |
|
|
| **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. | |
|
|
| **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. | |
|
|
| **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. | |
|
|
|
|
|
<img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" /> |
|
|
|
|
|
ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks: |
|
|
|
|
|
- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts. |
|
|
- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges. |
|
|
- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance |
|
|
|
|
|
ODA-Math-460k is evaluated as an SFT corpus for **Qwen3-8B-Base**. |
|
|
|
|
|
Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks. |
|
|
|
|
|
<div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;"> |
|
|
<table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;"> |
|
|
<caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption> |
|
|
<thead> |
|
|
<tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;"> |
|
|
<th style="text-align: left; padding: 8px;">Dataset</th> |
|
|
<th>Size</th> |
|
|
<th>GSM8K</th> |
|
|
<th>Math500</th> |
|
|
<th>Omni-Math</th> |
|
|
<th>Olympiad</th> |
|
|
<th>AIME'24</th> |
|
|
<th>AIME'25</th> |
|
|
<th>CMIMC'25</th> |
|
|
<th>HMMT'25</th> |
|
|
<th>BRUMO'25</th> |
|
|
<th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;"> |
|
|
<td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen3-8B-Base</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;">Qwen3-8B-Base</td> |
|
|
<td>-</td><td>92.0</td><td>79.6</td><td>30.6</td><td>47.2</td><td>6.7</td><td>10.8</td><td>4.7</td><td>0.0</td><td>16.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">32.0</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td> |
|
|
<td>817</td><td>83.9</td><td>69.0</td><td>21.8</td><td>31.3</td><td>12.5</td><td>8.8</td><td>2.2</td><td>1.7</td><td>13.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">27.2</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td> |
|
|
<td>414k</td><td>93.4</td><td>84.8</td><td>35.8</td><td>57.6</td><td>25.4</td><td>17.9</td><td>11.3</td><td>12.1</td><td>33.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">41.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td> |
|
|
<td>8k</td><td>92.8</td><td>86.6</td><td>39.6</td><td>61.0</td><td>28.8</td><td>25.8</td><td>14.1</td><td>13.3</td><td>34.2</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">44.0</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td> |
|
|
<td>79k</td><td>93.8</td><td>92.6</td><td>48.5</td><td>69.7</td><td>54.6</td><td>31.3</td><td>22.8</td><td>25.0</td><td>48.8</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.1</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td> |
|
|
<td>50k</td><td>93.9</td><td>93.8</td><td>58.8</td><td>71.5</td><td>58.8</td><td>45.8</td><td>28.4</td><td>32.9</td><td>54.2</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">59.8</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td> |
|
|
<td>719k</td><td><u>94.8</u></td><td><b>96.8</b></td><td>54.5</td><td><u>77.0</u></td><td>62.9</td><td>47.5</td><td>25.6</td><td>27.5</td><td>60.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">60.8</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td> |
|
|
<td>365k</td><td>94.2</td><td>95.4</td><td>59.0</td><td>74.9</td><td><b>67.9</b></td><td>45.4</td><td>31.3</td><td>35.8</td><td>52.5</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">61.8</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td> |
|
|
<td>558k</td><td><b>95.2</b></td><td>95.6</td><td><u>64.5</u></td><td><b>77.5</b></td><td>65.8</td><td><u>54.6</u></td><td><u>36.3</u></td><td><u>41.3</u></td><td><u>62.5</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>65.9</u></td> |
|
|
</tr> |
|
|
<tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;"> |
|
|
<td style="text-align: left; padding: 8px;">ODA-Math</td> |
|
|
<td>460k</td><td>94.3</td><td><u>96.0</u></td><td><b>66.9</b></td><td>76.3</td><td><b>67.9</b></td><td><b>63.3</b></td><td><b>41.6</b></td><td><b>45.4</b></td><td><b>67.5</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>68.8</b></td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## π About OpenDataArena |
|
|
|
|
|
[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing. |
|
|
|
|
|
**Key Features:** |
|
|
- π **Dataset Leaderboard** β helps researchers identify **the most valuable and high-quality datasets across different domains**. |
|
|
- π **Detailed Evaluation Scores** β provides **comprehensive metrics** to assess data quality, complexity, difficulty etc. |
|
|
- π§° **Data Processing Toolkit** β [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) |
|
|
offers an open-source pipeline for dataset curation and scoring. |
|
|
|
|
|
If you find our work helpful, please consider **β starring and subscribing** to support our research. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage |
|
|
|
|
|
Model repo: `OpenDataArena/Qwen3-8B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
MODEL_ID = "OpenDataArena/Qwen3-8B-ODA-Math-460k" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True) |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"}, |
|
|
] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@article{cai2025opendataarena, |
|
|
title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value}, |
|
|
author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others}, |
|
|
journal={arXiv preprint arXiv:2512.14051}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
|