BrowseComp is Fake Data

#16
by eddy1111111 - opened

BrowseComp 68.0 62.0 - - 51.4 60.6

The open-source version for long-context tasks exhibits significant degradation compared to the fifth-generation model.

BrowseComp 68.0 62.0 - - 51.4 60.6

The open-source version for long-context tasks exhibits significant degradation compared to the fifth-generation model.

I ran an end-to-end reproduction with a ReAct-based framework and no context-management mechanisms, and obtained 59.8.
Considering the differences in agent scaffold and toolsets, this gap looks quite reasonable to me.

My scaffold is open here:
https://github.com/RedSearchAgent/DeepTraceHub

We will check it

Sign up or log in to comment