Spaces:

gaia-benchmark
/

leaderboard

Running on CPU Upgrade

Leaderboard - 90%+

#81

by sheebz - opened Feb 3

Feb 3

Congrats to those who have gotten recent high scores but I have to say I'm a little skeptical. Getting 90% on L1, L2, L3 would be pretty difficult as at least 10 percent of the questions are arguably impossible. Then looking at the code referenced in paper, I don't see the kind of tools that would be required to do well. Have to wonder if there possibly was test leakage or something like that.

yc1838

8 days ago

Can u share how on the earth can I submit my submission? It kept saying my account is not authorized to submit.
As for your opinion, I agree. Not because of other things, simply because when I work with coding agent on building this agent, those AI would keep suggesting things thats definitely so over-fitting, and almost straight like cheating. Different LLMs are all the same, kept trying that unless I explicitly asked them isn't that like cheating? (and they mostly would agree). And probably people who want to get really high score on this were doing those overfitting a lot...

sheebz

about 21 hours ago

It might be your data is in wrong format or maybe you are using validation set instead of test set?

sheebz

about 21 hours ago

The questions are all answerable by human, so my thought from before was maybe they were doing some kind of RL and unintentionally leaking answers. 90% is pretty suspect and doesn't align with controlled environments like HAL/GAIA, even with frontier models.

yc1838

about 20 hours ago

I mean when I submitted on agents course student leaderboard it accepted my format :( but not here

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment