BrowseComp is Fake Data
#16
by eddy1111111 - opened
BrowseComp 68.0 62.0 - - 51.4 60.6
The open-source version for long-context tasks exhibits significant degradation compared to the fifth-generation model.
BrowseComp 68.0 62.0 - - 51.4 60.6
The open-source version for long-context tasks exhibits significant degradation compared to the fifth-generation model.
I ran an end-to-end reproduction with a ReAct-based framework and no context-management mechanisms, and obtained 59.8.
Considering the differences in agent scaffold and toolsets, this gap looks quite reasonable to me.
My scaffold is open here:
https://github.com/RedSearchAgent/DeepTraceHub
We will check it