Can the BrowseComp results be reproduced?
首先,肯定及感谢月之暗面团队的一系列工作 👍
其次,我想问自 kimi k2 thinking 发布之后及本次 kimi-k2.5,BrowseComp 的官方 report 结果有被哪位复现过么?dpsk/miromind/zai/tongyi 我均已进行完美复现。
再次,我给出我曾经进行的一系列复现操作,包括但不限于:
Toolsets
1. 对齐kimi-k2-thinking 官方 released trajectory 工具 format 的 toolsets
- search (serper)
- open (jina or jina with LLM pattern)
- python
2. GPT-OSS toolsets
- search (serper)
- open
- find
- python
Strategy
1. End2End 测试 (w/o context management)
Setting:
- 256k / temperature 1.0 / topp 0.95
- 完全对齐 K2 release trajectory format & observation template
- 对齐 GPT-OSS format
Results:
- K2-thinking BrowseComp ≈ 45
- K2.5 BrowseComp ≈ 45
2. With context management
Setting:
- keep recent 3 / 5 / 10 tool observations
Results:
- K2-thinking BrowseComp ≈ 53
- K2.5 暂未测试 with context management
然后,我想知道:
Can the BrowseComp End2End results be reproduced?
我有以下几点猜想:
- K2Thinking & K2.5 在 BrowseComp format 泛化能力不足,必须严格对齐 training format
- Benefits of In-House Tool Servers
- BrowceComp 的 60.6,是使用了keep recent x turns tool observations 或者 类似 kimi-k2-thinking 的 reach max context length之后 clean all tool observations 的策略(最可能的方案)
- ...
最后,再次肯定及感谢月之暗面团队的一系列工作 👍
Hey @Aldrich-x , are you able or willing to share which search agent framework you are using to try and reproduce these results? Is there a GitHub repo?
嘿 @Aldrich-x ,你是否方便或愿意分享一下你正在使用哪个搜索代理(search agent)框架来尝试复现这些结果?有没有对应的 GitHub 仓库链接?
Hey @Aldrich-x , are you able or willing to share which search agent framework you are using to try and reproduce these results? Is there a GitHub repo?
嘿 @Aldrich-x ,你是否方便或愿意分享一下你正在使用哪个搜索代理(search agent)框架来尝试复现这些结果?有没有对应的 GitHub 仓库链接?
An extremely simple ReAct framework—just align with the observation format used by each provider.
Hi, thank you very much for your interest in our work. The k2t and k2.5 scores you reported are far below the normal range, which suggests there may be some huge mismatches. I recommend first aligning with our w/o ctx mgm results to verify whether there is a bug in the settings — for reference, k2.5 should be 60.6. Then you can try adding context management methods.
We disclosed the system prompt used for testing in the technical report. You can try aligning with it to encourage the model to perform deep search. https://arxiv.org/pdf/2602.02276 Page 25.
You can also try reducing the search chunk size, e.g., controlling 2k per search call result, so the model has more context available for reasoning even without context management techniques.
- You can also try reducing the search chunk size, e.g., controlling 2k per search call result, so the model has more context available for reasoning even without context management techniques.
@dawnmsg , thank you for your great insights.
I just want to check my understanding of two things you mentioned in different discussions. In this comment you suggested “reducing the search chunk size … so the model has more context available for reasoning even without context management techniques.”
But in an earlier K2-Thinking discussion, you said “We don’t return a summary of the target pages — instead, we return the original text to the model and let the model browse the relevant information itself.”
I was wondering if you could clarify how these two points fit together. Specifically, when you reduce the search chunk size (for example to ~2k tokens per search call result), does that mean each call returns only a partial slice of the source text, and the model may not receive the full page text in-context unless it issues additional calls? Or does the system still return the full original text, just segmented differently?
Thanks again for any clarification you are able to provide on how to interpret this properly🙏
But in an earlier K2-Thinking discussion, you said “We don’t return a summary of the target pages — instead, we return the original text to the model and let the model browse the relevant information itself.”
I was wondering if you could clarify how these two points fit together. Specifically, when you reduce the search chunk size (for example to ~2k tokens per search call result), does that mean each call returns only a partial slice of the source text, and the model may not receive the full page text in-context unless it issues additional calls? Or does the system still return the full original text, just segmented differently?
Hi, thank you for your reply. Yes, we still don’t return summaries of the target pages; we return web_url and the truncated first n k tokens (e.g., 2k in this case) for the result. Since we have the browser_use tools, if K2.5 determines the information is useful and needs a careful read, it will call browser_visit to load the full page.
I hope these details help you reproduce our reported score. I’ll be happy to answer any questions if you run into any challenges.
I hope these details help you reproduce our reported score. I’ll be happy to answer any questions if you run into any challenges.
@dawnmsg , thank you again for your help and continued great insights.
I’m currently struggling to reproduce the BrowseComp results with context management. In the K2.5 tech report it says that for BrowseComp “we adopt the same discard-all strategy proposed by DeepSeek, where all history is truncated once token thresholds are exceeded.”
I also looked at the DeepSeek-V3.2 paper, and as I understand it, discard-all there is described as “when the token usage exceeds 80% of the context window length, discard-all resets the context by discarding all previous tool call history”, where it seems a bit more specific about tool call history rather than all history?
Could you shed a bit more light on how discard-all is implemented in the K2.5 context manager? For example:
- In the case where "all history is truncated" once the 80% context threshold is triggered, up to which point is it truncated to? Is it only the system prompt and initial user message/question that are always retained? Are any other messages retained?
- And if the entire prior context (except prompt and question) is discarded, I’m having a hard time understanding how the model avoids simply re-trying the same search trajectories that previously failed, since it no longer has the context of what was already attempted?
Thanks again so much for any clarification you are able to provide on how to interpret this properly🙏
I also looked at the DeepSeek-V3.2 paper, and as I understand it, discard-all there is described as “when the token usage exceeds 80% of the context window length, discard-all resets the context by discarding all previous tool call history”, where it seems a bit more specific about tool call history rather than all history?
“Discard-all, which resets the context by discarding all previous tool call history (similar to the new context tool (Anthropic, 2025a)).” Based on our understanding of the DeepSeek V3.2 paper and the paper it cites (the new context tool, Anthropic, 2025a), discard-all resets the context by discarding the reasoning associated with a tool call as well as the tool call results. It provide a new context with the initial question. In our implementation, we set an upper tool-call step threshold for discard-all to avoid endless retries in this case. We set it to 500.
In our implementation, we set an upper tool-call step threshold for discard-all to avoid endless retries in this case. We set it to 500.
@dawnmsg , thanks again for your insights which are really helpful towards community reproduction.
Just to confirm I’m interpreting this correctly: when you say "we set an upper tool-call step threshold for discard-all … to 500", you mean there’s a hard cap of 500 tool-invocation steps and that the tool-invocation step counter keeps accumulating across discard-all resets (it doesn’t reset when discard-all triggers), and the run is stopped once the total number of tool calls (the counter) reaches 500 to prevent endless looping. Is that the correct interpretation?
Thank you once again for any clarification you are able to provide on how to interpret this properly🙏
Yes, the tool call count will be accumulated.