Need guidance in reproducing AIME25 score

#21
by se-ok - opened

The AIME25 score is reported over 84.00, and I was trying to reproduce the result.

I am using the recommended docker image, with H100*8 and the recommended vLLM options precisely.

A request example is:

{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': "Solve the following math problem efficiently and clearly.  The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{ANSWER}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}

where the prompt is Llama3-like and another is:

{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': 'Solve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}.'}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}

where the prompt is from ArtificialAnalysis.ai.

In both cases, I tried setting default_system_prompt on/off so 4 trials, where in each trial 8 answers were generated from each question.

The pass@1 scores are between 72 and 74, which is significantly lower than reported, so I wonder what else I should adjust to harness the full capability.

The predicted answers are extracted from \boxed, and huggingface math_verify was used to decide the correctness.

Here is the full truth/pred/correctness table from default_system_prompt=False, prompt from artificial analysis.

truth pred correct
70 70 True
70 70 True
70 70 True
70 70 True
70 70 True
70 70 True
70 70 True
70 70 True
588 588 True
588 588 True
588 588 True
588 588 True
588 588 True
588 588 True
588 588 True
588 588 True
16 16 True
16 16 True
16 16 True
16 16 True
16 16 True
16 16 True
16 16 True
16 16 True
117 117 True
117 117 True
117 117 True
117 117 True
117 117 True
117 117 True
117 117 True
117 117 True
279 279 True
279 279 True
279 279 True
279 279 True
279 279 True
279 279 True
279 279 True
279 279 True
504 504 True
504 504 True
504 504 True
504 504 True
504 504 True
504 504 True
504 504 True
504 504 True
821 271 False
821 271 False
821 821 True
821 821 True
821 821 True
821 821 True
821 701 False
821 821 True
77 77 True
77 143 False
77 77 True
77 77 True
77 77 True
77 77 True
77 77 True
77 77 True
62 62 True
62 119 False
62 87 False
62 62 True
62 62 True
62 60 False
62 62 True
62 73 False
81 81 True
81 973520 False
81 81 True
81 62 False
81 81 True
81 95 False
81 57 False
81 81 True
259 259 True
259 259 True
259 4152 False
259 22 False
259 259 True
259 259 True
259 259 True
259 42 False
510 510 True
510 510 True
510 761308 False
510 510 True
510 510 True
510 510 True
510 510 True
510 303 False
204 \dfrac{559}{4} False
204 128 False
204 \dfrac{593}{6} False
204 \displaystyle \frac{787}{3} False
204 \dfrac{115}{3} False
204 529 False
204 \displaystyle \frac{187}{3}+300\cdot\frac{5}{12}= \frac{561}{4} False
204 \displaystyle \frac{399}{6} False
60 30 False
60 78 False
60 106 False
60 62 False
60 429 False
60 194 False
60 235 False
60 3521 False
735 273 False
735 197 False
735 461 False
735 729 False
735 147 False
735 999 False
735 499 False
735 479 False
468 468 True
468 468 True
468 468 True
468 468 True
468 468 True
468 468 True
468 468 True
468 468 True
49 49 True
49 49 True
49 49 True
49 49 True
49 49 True
49 49 True
49 49 True
49 49 True
82 82 True
82 82 True
82 100 False
82 82 True
82 82 True
82 82 True
82 82 True
82 82 True
106 106 True
106 106 True
106 106 True
106 106 True
106 106 True
106 106 True
106 106 True
106 106 True
336 378 False
336 336 True
336 336 True
336 168 False
336 768 False
336 552^{\circ} False
336 336 True
336 672 False
293 293 True
293 293 True
293 293 True
293 293 True
293 293 True
293 293 True
293 293 True
293 293 True
237 237 True
237 237 True
237 237 True
237 237 True
237 237 True
237 237 True
237 237 True
237 237 True
610 610 True
610 610 True
610 610 True
610 200 False
610 805 False
610 610 True
610 610 True
610 610 True
149 149 True
149 149 True
149 149 True
149 149 True
149 143 False
149 149 True
149 149 True
149 149 True
907 907 True
907 907 True
907 907 True
907 907 True
907 907 True
907 2907 False
907 907 True
907 907 True
113 113 True
113 113 True
113 113 True
113 113 True
113 113 True
113 113 True
113 113 True
113 113 True
19 19 True
19 19 True
19 19 True
19 19 True
19 19 True
19 19 True
19 19 True
19 19 True
248 736 False
248 933 False
248 162 False
248 177 False
248 3 False
248 501 False
248 627 False
248 6 False
104 104 True
104 104 True
104 104 True
104 n=104 True
104 104 True
104 104 True
104 n=104 True
104 104 True
240 \frac{182579}{1000} False
240 240 True
240 4146 False
240 3162 False
240 472 False
240 8+32+200=240 True
240 564 False
240 264 False

If it helps, in my trials the maximum length of generations were under 40k.

hi @se-ok , thanks for the report. we're refining a guidance for you. and yes, the more details you've tried with, the more helpful for us.

@se-ok this is the summary:

  1. make sure you use the latest config we provide
  2. make sure you set reasoning_effort=high
  3. the provided docker image is configured for efficient sserving. update these env variables for the benchmark.
SOLAR_REASONING_BUDGET_HIGH_MAX=131072
SOLAR_REASONING_BUDGET_HIGH_RATIO=100

e.g.

docker run --gpus all \
    --ipc=host \
    -p 8000:8000 \
    -e SOLAR_REASONING_BUDGET_HIGH_MAX=131072 \
    -e SOLAR_REASONING_BUDGET_HIGH_RATIO=100 \
    upstage/vllm-solar-open:latest \
    upstage/Solar-Open-100B \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser solar_open \
    --reasoning-parser solar_open \
    --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
    --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
    --tensor-parallel-size 8

or, on vLLM,

SOLAR_REASONING_BUDGET_HIGH_MAX=131072 SOLAR_REASONING_BUDGET_HIGH_RATIO=100 vllm serve upstage/Solar-Open-100B \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser solar_open \
    --reasoning-parser solar_open \
    --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
    --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
    --tensor-parallel-size 8

@keunwooupstage With the environment variables I have successfully reproduced the reported AIME25 score. I've got Pass@1 85.00 from 8 runs.
Due to the provided settings I now understand that the custom logit processor was cutting off the thinking part at 32k tokens by default, which was an interesting tweak.

Thank you for helping me out and congratulations to your achievement!

upstage org

Very happy that you could reproduce the results. Thank you for your work.

upstage org

@keunwooupstage should we update it to 85?

closing this issue now.
(and we keep the original score in the benchmark table)

keunwooupstage changed discussion status to closed

Sign up or log in to comment