3-stage adaptive evaluation comparison with Qwen3.6-27B?

#1
by SkyMind - opened

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

FINAL_Bench org

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.

So I believe they're using the protocol in the Qwen3 paper.

https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

With one caveat, all of their tabulated results are consistent with that. The caveat is that they misidentified Qwen3-235B-A22B-Thinking-2507 results as Qwen3-235B-A22B in the results table on https://huggingface.co/Qwen/Qwen3.5-122B-A10B; if that's correct, Qwen3.5 and Qwen3.6 results can be chained back via comparisons to other Qwen model scores to those in the paper. (The assumption being they evaluate their own models reported in a given table consistently, which looks to hold.)

Qwen3_Technical_Report.evalNotes


https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

4.6 Post-training Evaluation

For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy.

For all Qwen3 models in the thinking mode, we utilize a sampling temperature of 0.6, a top-p value
of 0.95, and a top-k value of 20.


GPQA-Diamond, Thinking
Qwen3-235B-A22B   71.1
Qwen3-30B-A3B        65.8
Qwen3-32B                  68.4
Qwen3-14B                  64.0
Chain of GPQA-Diamond scores:

https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen3-30B-A3B-Thinking-2507     73.4
Qwen3-30B-A3B-Thinking                 65.8



https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen3-235B-A22B                                  71.1
Qwen3-235B-A22B-Thinking-2507   81.1    [IFeval 87.8, MMLU-Pro 84.4]


https://huggingface.co/Qwen/Qwen3.5-122B-A10B
Qwen3-235B-A22B          81.1                    [typo???  Thinking-2507?--also IFeval 87.8, MMLU-Pro 84.4]
Qwen3.5-122B-A10B      86.6
Qwen3.5-27B                     85.5

Sign up or log in to comment