Update README.md
Browse files
README.md
CHANGED
|
@@ -239,5 +239,5 @@ We include a brief word on methodology here - and in particular, how we think ab
|
|
| 239 |
Benchmark datasets
|
| 240 |
We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically:
|
| 241 |
+ Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving.
|
| 242 |
-
+ AIME 2024/AIME 2025: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. We evaluate the models on the problems from both
|
| 243 |
+ GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.
|
|
|
|
| 239 |
Benchmark datasets
|
| 240 |
We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically:
|
| 241 |
+ Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving.
|
| 242 |
+
+ AIME 2024/AIME 2025: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. We evaluate the models on the problems from both 2024 and the year 2025 examinations.
|
| 243 |
+ GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.
|