Update README.md
Browse files
README.md
CHANGED
|
@@ -1,60 +1,298 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
-
|
| 4 |
-
|
|
|
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
- learning_rate: 5e-05
|
| 39 |
-
- train_batch_size: 2
|
| 40 |
-
- eval_batch_size: 8
|
| 41 |
-
- seed: 42
|
| 42 |
-
- distributed_type: multi-GPU
|
| 43 |
-
- num_devices: 24
|
| 44 |
-
- total_train_batch_size: 48
|
| 45 |
-
- total_eval_batch_size: 192
|
| 46 |
-
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 47 |
-
- lr_scheduler_type: cosine
|
| 48 |
-
- lr_scheduler_warmup_ratio: 0.03
|
| 49 |
-
- num_epochs: 3.0
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
|
|
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
### Framework versions
|
| 56 |
|
| 57 |
-
- Transformers 4.49.0
|
| 58 |
-
- Pytorch 2.8.0+cu128
|
| 59 |
-
- Datasets 3.2.0
|
| 60 |
-
- Tokenizers 0.21.0
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-7B-Base
|
| 3 |
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
datasets:
|
| 6 |
+
- OpenDataArena/ODA-Math-460k
|
| 7 |
tags:
|
| 8 |
+
- qwen2.5
|
| 9 |
+
- sft
|
| 10 |
+
- opendataarena
|
| 11 |
+
- oda-math
|
| 12 |
+
- math
|
| 13 |
+
- reasoning
|
| 14 |
+
license: cc-by-nc-4.0
|
| 15 |
+
language:
|
| 16 |
+
- en
|
| 17 |
+
metrics:
|
| 18 |
+
- accuracy
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# Qwen2.5-7B-ODA-Math-460k
|
| 22 |
+
<img src="performance.png" alt="Leaderboard Performance" width="1200" />
|
| 23 |
|
| 24 |
+
Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**.
|
| 25 |
|
| 26 |
+
ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**.
|
| 27 |
+
It targets a β**learnable but challenging**β difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.
|
| 28 |
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## π§ Model Summary
|
| 32 |
+
|
| 33 |
+
- **Base Model**: `Qwen/Qwen2.5-7B-Base`
|
| 34 |
+
- **Training Data**: `OpenDataArena/ODA-Math-460k`
|
| 35 |
+
- **Domain Coverage**: Mathematics (strictly filtered)
|
| 36 |
+
- **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline)
|
| 37 |
+
- **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## βοΈ Training Data Curation Pipeline
|
| 42 |
+
|
| 43 |
+
ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.
|
| 44 |
+
|
| 45 |
+
### 1οΈβ£ Data Collection
|
| 46 |
+
|
| 47 |
+
We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math.
|
| 48 |
+
|
| 49 |
+
### 2οΈβ£ Deduplication & Decontamination
|
| 50 |
+
|
| 51 |
+
We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.
|
| 52 |
+
|
| 53 |
+
### 3οΈβ£ Question Filtering (Quality & Suitability)
|
| 54 |
+
|
| 55 |
+
A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβleaving predominantly **free-form** problems with objectively verifiable answers.
|
| 56 |
+
|
| 57 |
+
### π Filtration Statistics
|
| 58 |
+
|
| 59 |
+
| Pipeline Stage | Count | Percentage |
|
| 60 |
+
|---|---:|---:|
|
| 61 |
+
| Raw Collection | 11.4M | 100% |
|
| 62 |
+
| Dedup & Decontamination | 4.3M | 37.7% |
|
| 63 |
+
| Question Filtering | 3.3M | 28.9% |
|
| 64 |
+
| Stage-1 Filtering | 815.3K | 7.2% |
|
| 65 |
+
| Stage-2 Filtering | 459.6K | 4.0% |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## π― Data Selection
|
| 70 |
+
|
| 71 |
+
Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**.
|
| 72 |
+
|
| 73 |
+
### Stage-1: Lower-Bound Filtering
|
| 74 |
+
|
| 75 |
+
Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct).
|
| 76 |
+
|
| 77 |
+
### Stage-2: Upper-Bound Filtering
|
| 78 |
+
|
| 79 |
+
Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it).
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## β
Distillation & Verification
|
| 84 |
+
|
| 85 |
+
### π§ͺ Response Synthesis
|
| 86 |
+
|
| 87 |
+
We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem.
|
| 88 |
+
|
| 89 |
+
### π Response Verification
|
| 90 |
+
|
| 91 |
+
We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβso the released dataset contains **verified solutions only**.
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## π Training Data Source Composition
|
| 96 |
+
|
| 97 |
+
ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:
|
| 98 |
+
|
| 99 |
+
| Source | Count | Percentage |
|
| 100 |
+
|---|---:|---:|
|
| 101 |
+
| ScaleQuest-Math | 87,755 | 19.09% |
|
| 102 |
+
| NuminaMath-CoT | 75,971 | 16.53% |
|
| 103 |
+
| OpenMathInstruct-2 | 65,688 | 14.29% |
|
| 104 |
+
| MegaScience (math) | 54,904 | 11.94% |
|
| 105 |
+
| OpenMathReasoning | 49,463 | 10.76% |
|
| 106 |
+
| AM-Thinking-Distilled | 38,375 | 8.35% |
|
| 107 |
+
| MiroMind-M1-SFT-719K | 23,417 | 5.09% |
|
| 108 |
+
| SCP-116K | 16,066 | 3.50% |
|
| 109 |
+
| DeepMath-309K | 11,956 | 2.60% |
|
| 110 |
+
| math-gpt-4o-200k | 8,355 | 1.82% |
|
| 111 |
+
| OpenR1-Math-220k | 7,999 | 1.74% |
|
| 112 |
+
| MathFusionQA | 6,510 | 1.42% |
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## π¬ Content Characteristics
|
| 117 |
+
|
| 118 |
+
### π Subject Distribution
|
| 119 |
|
| 120 |
+
<img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" />
|
| 121 |
|
| 122 |
+
ODA-Math-460k maintains a **more balanced** subject composition than several peers:
|
| 123 |
+
- Algebra remains substantial (**~44.8%**),
|
| 124 |
+
- Geometry roughly **20β22%**,
|
| 125 |
+
- Calculus, Discrete Math & Probability, and Number Theory each around **~11%**.
|
| 126 |
|
| 127 |
+
This mitigates subject bias and reduces performance drops on underrepresented topics.
|
| 128 |
|
| 129 |
+
### π Difficulty Distribution
|
| 130 |
|
| 131 |
+
Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings).
|
| 132 |
|
| 133 |
+
| Level | Equivalent Competition Tier | Description |
|
| 134 |
+
| :--- | :--- | :--- |
|
| 135 |
+
| **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. |
|
| 136 |
+
| **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
|
| 137 |
+
| **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
|
| 138 |
+
| **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
|
| 139 |
+
| **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
|
| 140 |
+
| **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. |
|
| 141 |
+
| **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
|
| 142 |
+
| **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. |
|
| 143 |
+
| **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. |
|
| 144 |
+
| **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. |
|
| 145 |
|
| 146 |
+
<img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" />
|
| 147 |
|
| 148 |
+
ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
+
- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
|
| 151 |
+
- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
|
| 152 |
+
- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## π Performance
|
| 157 |
+
|
| 158 |
+
ODA-Math-460k is evaluated as an SFT corpus for **Qwen2.5-7B-Base**.
|
| 159 |
+
|
| 160 |
+
Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks.
|
| 161 |
+
|
| 162 |
+
<div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;">
|
| 163 |
+
<table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;">
|
| 164 |
+
<caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption>
|
| 165 |
+
<thead>
|
| 166 |
+
<tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;">
|
| 167 |
+
<th style="text-align: left; padding: 8px;">Dataset</th>
|
| 168 |
+
<th>Size</th>
|
| 169 |
+
<th>GSM8K</th>
|
| 170 |
+
<th>Math500</th>
|
| 171 |
+
<th>Omni-Math</th>
|
| 172 |
+
<th>Olympiad</th>
|
| 173 |
+
<th>AIME'24</th>
|
| 174 |
+
<th>AIME'25</th>
|
| 175 |
+
<th>CMIMC'25</th>
|
| 176 |
+
<th>HMMT'25</th>
|
| 177 |
+
<th>BRUMO'25</th>
|
| 178 |
+
<th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th>
|
| 179 |
+
</tr>
|
| 180 |
+
</thead>
|
| 181 |
+
<tbody>
|
| 182 |
+
<tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;">
|
| 183 |
+
<td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td>
|
| 184 |
+
</tr>
|
| 185 |
+
<tr>
|
| 186 |
+
<td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td>
|
| 187 |
+
<td>-</td><td>80.0</td><td>50.2</td><td>26.0</td><td>35.9</td><td>6.7</td><td>6.7</td><td>10.0</td><td>0.0</td><td>20.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.2</td>
|
| 188 |
+
</tr>
|
| 189 |
+
<tr>
|
| 190 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td>
|
| 191 |
+
<td>817</td><td>92.1</td><td>66.8</td><td>21.6</td><td>34.9</td><td>4.6</td><td>1.7</td><td>0.0</td><td>0.0</td><td>5.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">25.2</td>
|
| 192 |
+
</tr>
|
| 193 |
+
<tr>
|
| 194 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/nvidia/OpenMathInstruct-2">OpenMathInstruct-2</a></td>
|
| 195 |
+
<td>1M</td><td>91.6</td><td>65.9</td><td>22.5</td><td>30.7</td><td>6.7</td><td>5.0</td><td>5.0</td><td>0.0</td><td>13.6</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.8</td>
|
| 196 |
+
</tr>
|
| 197 |
+
<tr>
|
| 198 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td>
|
| 199 |
+
<td>414k</td><td>90.1</td><td>77.8</td><td>28.7</td><td>44.5</td><td>16.7</td><td>15.0</td><td>8.1</td><td>0.0</td><td>26.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">34.2</td>
|
| 200 |
+
</tr>
|
| 201 |
+
<tr>
|
| 202 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td>
|
| 203 |
+
<td>8k</td><td>90.6</td><td>80.0</td><td>35.8</td><td>50.3</td><td>23.3</td><td>26.7</td><td>7.5</td><td>8.3</td><td>31.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">39.4</td>
|
| 204 |
+
</tr>
|
| 205 |
+
<tr>
|
| 206 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/zwhe99/DeepMath-103K">DeepMath-103K</a></td>
|
| 207 |
+
<td>103k</td><td>92.1</td><td>92.0</td><td>45.4</td><td>60.2</td><td>34.2</td><td>31.7</td><td>10.0</td><td>11.7</td><td>15.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">43.6</td>
|
| 208 |
+
</tr>
|
| 209 |
+
<tr>
|
| 210 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td>
|
| 211 |
+
<td>79k</td><td>92.0</td><td>88.0</td><td>43.3</td><td>60.2</td><td>38.3</td><td>26.7</td><td>22.5</td><td>13.3</td><td>38.3</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">47.0</td>
|
| 212 |
+
</tr>
|
| 213 |
+
<tr>
|
| 214 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td>
|
| 215 |
+
<td>50k</td><td>92.1</td><td>90.0</td><td>54.5</td><td>67.4</td><td>45.0</td><td>35.0</td><td>19.7</td><td>20.0</td><td>36.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">51.2</td>
|
| 216 |
+
</tr>
|
| 217 |
+
<tr>
|
| 218 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td>
|
| 219 |
+
<td>719k</td><td><u>93.9</u></td><td>91.6</td><td>48.1</td><td>66.3</td><td>55.0</td><td>30.0</td><td>27.5</td><td>18.3</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.4</td>
|
| 220 |
+
</tr>
|
| 221 |
+
<tr>
|
| 222 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td>
|
| 223 |
+
<td>365k</td><td>93.2</td><td>89.8</td><td>54.3</td><td>68.1</td><td>50.4</td><td>40.0</td><td>25.0</td><td>28.3</td><td>45.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.9</td>
|
| 224 |
+
</tr>
|
| 225 |
+
<tr>
|
| 226 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M">OpenThoughts3</a></td>
|
| 227 |
+
<td>1.2M</td><td>91.7</td><td>93.8</td><td>44.8</td><td>68.8</td><td><u>60.0</u></td><td>45.0</td><td>27.5</td><td>31.7</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">57.0</td>
|
| 228 |
+
</tr>
|
| 229 |
+
<tr>
|
| 230 |
+
<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td>
|
| 231 |
+
<td>558k</td><td>92.9</td><td><b>96.2</b></td><td><u>60.6</u></td><td><b>74.2</b></td><td><b>63.3</b></td><td><u>50.0</u></td><td><u>27.8</u></td><td><u>36.7</u></td><td><b>63.3</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>62.8</u></td>
|
| 232 |
+
</tr>
|
| 233 |
+
<tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;">
|
| 234 |
+
<td style="text-align: left; padding: 8px;">ODA-Math</td>
|
| 235 |
+
<td>460k</td><td><b>94.3</b></td><td><u>95.4</u></td><td><b>62.6</b></td><td><u>70.9</u></td><td>56.7</td><td><b>56.7</b></td><td><b>35.0</b></td><td><b>45.0</b></td><td><u>60.0</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>64.1</b></td>
|
| 236 |
+
</tr>
|
| 237 |
+
</tbody>
|
| 238 |
+
</table>
|
| 239 |
+
</div>
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
+
|
| 243 |
+
## π About OpenDataArena
|
| 244 |
+
|
| 245 |
+
[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
|
| 246 |
+
|
| 247 |
+
**Key Features:**
|
| 248 |
+
- π **Dataset Leaderboard** β helps researchers identify **the most valuable and high-quality datasets across different domains**.
|
| 249 |
+
- π **Detailed Evaluation Scores** β provides **comprehensive metrics** to assess data quality, complexity, difficulty etc.
|
| 250 |
+
- π§° **Data Processing Toolkit** β [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool)
|
| 251 |
+
offers an open-source pipeline for dataset curation and scoring.
|
| 252 |
+
|
| 253 |
+
If you find our work helpful, please consider **β starring and subscribing** to support our research.
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## π Usage
|
| 258 |
+
|
| 259 |
+
Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference:
|
| 260 |
+
|
| 261 |
+
```python
|
| 262 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 263 |
+
|
| 264 |
+
MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k"
|
| 265 |
+
|
| 266 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
|
| 267 |
+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
|
| 268 |
+
|
| 269 |
+
messages = [
|
| 270 |
+
{"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
|
| 271 |
+
]
|
| 272 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 273 |
+
inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
| 274 |
+
|
| 275 |
+
outputs = model.generate(
|
| 276 |
+
**inputs,
|
| 277 |
+
max_new_tokens=512,
|
| 278 |
+
do_sample=True,
|
| 279 |
+
temperature=0.7,
|
| 280 |
+
top_p=0.9,
|
| 281 |
+
)
|
| 282 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
|
| 287 |
+
## π Citation
|
| 288 |
|
| 289 |
+
```bibtex
|
| 290 |
+
@article{cai2025opendataarena,
|
| 291 |
+
title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
|
| 292 |
+
author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
|
| 293 |
+
journal={arXiv preprint arXiv:2512.14051},
|
| 294 |
+
year={2025}
|
| 295 |
+
}
|
| 296 |
+
```
|
| 297 |
|
|
|
|
| 298 |
|
|
|
|
|
|
|
|
|
|
|
|