OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published about 1 month ago • 65
DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training Paper • 2602.05890 • Published Feb 5 • 1
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents Paper • 2602.12984 • Published Feb 13 • 7