Why Do So Many Chinese Companies Cheat On Tests?

#1
by phil111 - opened

For example, this model has a reported SimpleQA score 27, but its real score is ~6. And Qwen3.5 35b, from which this model was derived, has a SimpleQA of ~8, not 20.

There's a huge difference between a SimpleQA score of 10 and 20, and to date the smallest model that legitimately achieved a SimpleQA score of 20 is Llama 3.1 70b, and its top English competitor Gemma 4 scored 9.

For Alibaba to try to claim that Qwen3.5 scored vastly higher in broad English knowledge than the top English competitor in the same size category (20 vs 9) is nothing short of insane. Why cheat so egregiously that everyone in the industry knows you cheated?

Note: The English SimpleQA is a non-multiple choice broad English knowledge test, so unlike many other tests you can't fine-tune your way to a higher score. You either need more parameters or to train for a very long time on a very large and broad English corpus.

There may be some misunderstanding here. The SimpleQA test used here is SimpleQA Verified provided by Google. You can reproduce the evaluation results using OpenCompass (the dataset configuration file is opencompass/configs/datasets/SimpleQA/simpleqa_verified_rawprompt_gen.py).

Thank you for pointing this out. We have updated the benchmark table and adopted more rigorous wording.

@mzr1996 Thanks for clarifying. But as explained by Google this is the same SimpleQA released by OpenAI, but improved (e.g. got rid of redundancy and other issues). It's still a hard non-multiple choice broad English knowledge test evaluated by GPT 4.1 with scores that correlate with the original test, and Google's Gemma 4 scored ~9.

There's simply no way the comparably-sized Qwen3.5 scored 20, or this model scored 27, unless the models were contaminated by the test. The only legitimate way to climb 11 and 18 points on a non-multiple choice broad English knowledge test is to train a much larger model on a huge amount of broad English knowledge, which these models did not do (verified myself by asking sister questions). These models without a doubt do not have more broad English knowledge than Gemma 4, let alone the vastly more broad English knowledge needed to achieve the astonishingly high scores of 20 and 27, respectively.

Sign up or log in to comment