Spaces:
Running on CPU Upgrade
Leaderboard - 90%+
Congrats to those who have gotten recent high scores but I have to say I'm a little skeptical. Getting 90% on L1, L2, L3 would be pretty difficult as at least 10 percent of the questions are arguably impossible. Then looking at the code referenced in paper, I don't see the kind of tools that would be required to do well. Have to wonder if there possibly was test leakage or something like that.
Can u share how on the earth can I submit my submission? It kept saying my account is not authorized to submit.
As for your opinion, I agree. Not because of other things, simply because when I work with coding agent on building this agent, those AI would keep suggesting things thats definitely so over-fitting, and almost straight like cheating. Different LLMs are all the same, kept trying that unless I explicitly asked them isn't that like cheating? (and they mostly would agree). And probably people who want to get really high score on this were doing those overfitting a lot...
It might be your data is in wrong format or maybe you are using validation set instead of test set?
The questions are all answerable by human, so my thought from before was maybe they were doing some kind of RL and unintentionally leaking answers. 90% is pretty suspect and doesn't align with controlled environments like HAL/GAIA, even with frontier models.
I mean when I submitted on agents course student leaderboard it accepted my format :( but not here