DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training Paper • 2602.05890 • Published Feb 5 • 1
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents Paper • 2602.12984 • Published Feb 13 • 5
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation Paper • 2506.04078 • Published Jun 4, 2025 • 1
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm Paper • 2511.04570 • Published Nov 6, 2025 • 242
Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies Paper • 2601.12369 • Published Jan 18 • 4
Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies Paper • 2601.12369 • Published Jan 18 • 4
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models Paper • 2601.14004 • Published Jan 20 • 47
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control Paper • 2601.03973 • Published Jan 7 • 2
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation Paper • 2506.04078 • Published Jun 4, 2025 • 1