Context Management Reproducibility | 可复现性 ?
Hi Moonshot team, thank you so much for continually open-sourcing such impressive models and sharing your research!
Just a question regarding reproducibility for the dev and research community: Footnote 2 in your blog post describes an HLE context management strategy where "only the latest round of tool messages is retained," while Footnote 3 references a "discard-all" strategy for BrowseComp.
Are the context management strategies you used for BrowseComp/HLE evaluation open-source, or do you plan to open-source them? If not, could you please elaborate on the HLE context management strategy or perhaps provide pseudo-code for how that specific retention logic works? Also, is this distinct from the BrowseComp "discard-all" strategy you alluded to?
Thanks again for your amazing work! 🙏
Moonshot 团队你们好,感谢你们持续开源如此出色的模型并分享研究成果!
我有一个关于面向开发与研究社区的可复现性问题:你们博客文章的脚注 2 描述了一种用于 HLE 的 context management 策略,其中提到 “only the latest round of tool messages is retained”,而脚注 3 又提到了用于 BrowseComp 的 “discard-all” 策略。
请问你们在 BrowseComp/HLE evaluation 中使用的 context management 策略是否已经开源,或计划开源?如果目前没有开源,能否请你们进一步解释一下 HLE 的 context management 策略,或者提供 pseudo-code,说明那个特定的保留(retention)逻辑具体是如何工作的?另外,这个策略是否与您提到的 BrowseComp “discard-all” 策略不同?
再次感谢你们精彩的工作!🙏
@courage17340 @teowu @bigeagle @bigmoyan @dawnmsg
UPDATE: I mocked up some of the possible diagram interpretations for the HLE context management you described. Please let me know if one of the diagrams accurately represents the context management strategy that was used 🙏 or if all my interpretations are wrong:
1st interpretation: (Retaining the chain of thought, but pruning bulky outputs) In this strategy, the full history of the Assistant's "Think" and "Tool Call" steps is preserved to maintain the reasoning chain, but the intermediate "Tool Results" (which often consume the most tokens) are omitted, except for the very latest round.
2nd interpretation: (Strict "Latest Round" window) In this strategy, all intermediate history, including both the Assistant's reasoning steps and the User's tool results, is discarded. The context is strictly limited to the original User Question and the single most recent round of Think & Tool Call/Result.
3rd interpretation: (Orphaned Result retention) In this strategy, the latest "Tool Result" is retained, but the specific "Tool Call" that generated it (and all prior history) is omitted. This effectively provides the model with the latest observation state without the immediate history of the action that caused it.
Thank you for pointing that out. The predefined threshold for HLE evaluation is 96k; once the task's context length exceeds this limit, Hide-Tool-Result Context Management is enabled.
Thanks for your reply. And what about the token threshold for BrowseComp evaluation with context management? It seems like this point is also not clearly specified.
In browsecomp setting, we fully reproduce the discard-all logic proposed in DeepSeek V3.2 report, so the threshold of triggering ctx managemnt is 80% of the context limit.
Thank you for pointing that out. The predefined threshold for HLE evaluation is 96k; once the task's context length exceeds this limit, Hide-Tool-Result Context Management is enabled.
In browsecomp setting, we fully reproduce the discard-all logic proposed in DeepSeek V3.2 report, so the threshold of triggering ctx managemnt is 80% of the context limit.
@dawnmsg @Longhui98 , thank you for the helpful clarification about the specific token thresholds for HLE and BrowseComp evaluation.
Just to confirm the intended interpretation of Hide-Tool-Result: in my 1st diagram interpretation above, the chain of thought (all Think and Tool Call steps) is preserved fully, but only the most recent tool result is retained once the context length exceeds the threshold, with older tool results being omitted to save space. Does that accurately reflect how the Hide-Tool-Result context management strategy works in practice?
Thanks again! 😊



