upstage/Solar-Open-100B · Need guidance in reproducing AIME25 score

Jan 6

The AIME25 score is reported over 84.00, and I was trying to reproduce the result.

I am using the recommended docker image, with H100*8 and the recommended vLLM options precisely.

A request example is:

{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': "Solve the following math problem efficiently and clearly.  The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{ANSWER}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}

where the prompt is Llama3-like and another is:

{'model': '', 'temperature': 0.8, 'max_tokens': 120000, 'top_p': 0.95, 'messages': [{'role': 'user', 'content': 'Solve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}.'}], 'skip_special_tokens': False, 'chat_template_kwargs': {'default_system_prompt': False}}

where the prompt is from ArtificialAnalysis.ai.

In both cases, I tried setting default_system_prompt on/off so 4 trials, where in each trial 8 answers were generated from each question.

The pass@1 scores are between 72 and 74, which is significantly lower than reported, so I wonder what else I should adjust to harness the full capability.

The predicted answers are extracted from \boxed, and huggingface math_verify was used to decide the correctness.

Here is the full truth/pred/correctness table from default_system_prompt=False, prompt from artificial analysis.

truth	pred	correct
70	70	True
70	70	True
70	70	True
70	70	True
70	70	True
70	70	True
70	70	True
70	70	True
588	588	True
588	588	True
588	588	True
588	588	True
588	588	True
588	588	True
588	588	True
588	588	True
16	16	True
16	16	True
16	16	True
16	16	True
16	16	True
16	16	True
16	16	True
16	16	True
117	117	True
117	117	True
117	117	True
117	117	True
117	117	True
117	117	True
117	117	True
117	117	True
279	279	True
279	279	True
279	279	True
279	279	True
279	279	True
279	279	True
279	279	True
279	279	True
504	504	True
504	504	True
504	504	True
504	504	True
504	504	True
504	504	True
504	504	True
504	504	True
821	271	False
821	271	False
821	821	True
821	821	True
821	821	True
821	821	True
821	701	False
821	821	True
77	77	True
77	143	False
77	77	True
77	77	True
77	77	True
77	77	True
77	77	True
77	77	True
62	62	True
62	119	False
62	87	False
62	62	True
62	62	True
62	60	False
62	62	True
62	73	False
81	81	True
81	973520	False
81	81	True
81	62	False
81	81	True
81	95	False
81	57	False
81	81	True
259	259	True
259	259	True
259	4152	False
259	22	False
259	259	True
259	259	True
259	259	True
259	42	False
510	510	True
510	510	True
510	761308	False
510	510	True
510	510	True
510	510	True
510	510	True
510	303	False
204	\dfrac{559}{4}	False
204	128	False
204	\dfrac{593}{6}	False
204	\displaystyle \frac{787}{3}	False
204	\dfrac{115}{3}	False
204	529	False
204	\displaystyle \frac{187}{3}+300\cdot\frac{5}{12}= \frac{561}{4}	False
204	\displaystyle \frac{399}{6}	False
60	30	False
60	78	False
60	106	False
60	62	False
60	429	False
60	194	False
60	235	False
60	3521	False
735	273	False
735	197	False
735	461	False
735	729	False
735	147	False
735	999	False
735	499	False
735	479	False
468	468	True
468	468	True
468	468	True
468	468	True
468	468	True
468	468	True
468	468	True
468	468	True
49	49	True
49	49	True
49	49	True
49	49	True
49	49	True
49	49	True
49	49	True
49	49	True
82	82	True
82	82	True
82	100	False
82	82	True
82	82	True
82	82	True
82	82	True
82	82	True
106	106	True
106	106	True
106	106	True
106	106	True
106	106	True
106	106	True
106	106	True
106	106	True
336	378	False
336	336	True
336	336	True
336	168	False
336	768	False
336	552^{\circ}	False
336	336	True
336	672	False
293	293	True
293	293	True
293	293	True
293	293	True
293	293	True
293	293	True
293	293	True
293	293	True
237	237	True
237	237	True
237	237	True
237	237	True
237	237	True
237	237	True
237	237	True
237	237	True
610	610	True
610	610	True
610	610	True
610	200	False
610	805	False
610	610	True
610	610	True
610	610	True
149	149	True
149	149	True
149	149	True
149	149	True
149	143	False
149	149	True
149	149	True
149	149	True
907	907	True
907	907	True
907	907	True
907	907	True
907	907	True
907	2907	False
907	907	True
907	907	True
113	113	True
113	113	True
113	113	True
113	113	True
113	113	True
113	113	True
113	113	True
113	113	True
19	19	True
19	19	True
19	19	True
19	19	True
19	19	True
19	19	True
19	19	True
19	19	True
248	736	False
248	933	False
248	162	False
248	177	False
248	3	False
248	501	False
248	627	False
248	6	False
104	104	True
104	104	True
104	104	True
104	n=104	True
104	104	True
104	104	True
104	n=104	True
104	104	True
240	\frac{182579}{1000}	False
240	240	True
240	4146	False
240	3162	False
240	472	False
240	8+32+200=240	True
240	564	False
240	264	False

se-ok

Jan 7

•

edited Jan 7

If it helps, in my trials the maximum length of generations were under 40k.

keunwooupstage

upstage org Jan 7

hi @se-ok , thanks for the report. we're refining a guidance for you. and yes, the more details you've tried with, the more helpful for us.

keunwooupstage

upstage org Jan 7

@se-ok this is the summary:

make sure you use the latest config we provide
make sure you set reasoning_effort=high
the provided docker image is configured for efficient sserving. update these env variables for the benchmark.

SOLAR_REASONING_BUDGET_HIGH_MAX=131072
SOLAR_REASONING_BUDGET_HIGH_RATIO=100

e.g.

docker run --gpus all \
    --ipc=host \
    -p 8000:8000 \
    -e SOLAR_REASONING_BUDGET_HIGH_MAX=131072 \
    -e SOLAR_REASONING_BUDGET_HIGH_RATIO=100 \
    upstage/vllm-solar-open:latest \
    upstage/Solar-Open-100B \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser solar_open \
    --reasoning-parser solar_open \
    --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
    --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
    --tensor-parallel-size 8

or, on vLLM,

SOLAR_REASONING_BUDGET_HIGH_MAX=131072 SOLAR_REASONING_BUDGET_HIGH_RATIO=100 vllm serve upstage/Solar-Open-100B \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser solar_open \
    --reasoning-parser solar_open \
    --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor \
    --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor \
    --tensor-parallel-size 8

se-ok

Jan 8

@keunwooupstage With the environment variables I have successfully reproduced the reported AIME25 score. I've got Pass@1 85.00 from 8 runs.
Due to the provided settings I now understand that the custom logit processor was cutting off the thinking part at 32k tokens by default, which was an interesting tweak.

Thank you for helping me out and congratulations to your achievement!

hunkimup

upstage org Jan 8

Very happy that you could reproduce the results. Thank you for your work.

hunkimup

upstage org Jan 8

@keunwooupstage should we update it to 85?

keunwooupstage

upstage org Jan 16

closing this issue now.
(and we keep the original score in the benchmark table)

keunwooupstage changed discussion status to closed Jan 16