3-stage adaptive evaluation comparison with Qwen3.6-27B?
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.
So I believe they're using the protocol in the Qwen3 paper.
https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf
With one caveat, all of their tabulated results are consistent with that. The caveat is that they misidentified Qwen3-235B-A22B-Thinking-2507 results as Qwen3-235B-A22B in the results table on https://huggingface.co/Qwen/Qwen3.5-122B-A10B; if that's correct, Qwen3.5 and Qwen3.6 results can be chained back via comparisons to other Qwen model scores to those in the paper. (The assumption being they evaluate their own models reported in a given table consistently, which looks to hold.)
Qwen3_Technical_Report.evalNotes
https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf
4.6 Post-training Evaluation
For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy.
For all Qwen3 models in the thinking mode, we utilize a sampling temperature of 0.6, a top-p value
of 0.95, and a top-k value of 20.
GPQA-Diamond, Thinking
Qwen3-235B-A22B 71.1
Qwen3-30B-A3B 65.8
Qwen3-32B 68.4
Qwen3-14B 64.0
Chain of GPQA-Diamond scores:
https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen3-30B-A3B-Thinking-2507 73.4
Qwen3-30B-A3B-Thinking 65.8
https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen3-235B-A22B 71.1
Qwen3-235B-A22B-Thinking-2507 81.1 [IFeval 87.8, MMLU-Pro 84.4]
https://huggingface.co/Qwen/Qwen3.5-122B-A10B
Qwen3-235B-A22B 81.1 [typo??? Thinking-2507?--also IFeval 87.8, MMLU-Pro 84.4]
Qwen3.5-122B-A10B 86.6
Qwen3.5-27B 85.5