MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks Paper • 2601.14652 • Published Jan 21 • 4
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark Paper • 2501.01290 • Published Jan 2, 2025 • 1
Nemotron-Terminal Collection We are releasing Nemotron-Terminal models and training datasets. • 5 items • Updated 1 day ago • 31
Endless Terminals: Scaling RL Environments for Terminal Agents Paper • 2601.16443 • Published Jan 23 • 18
TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents Paper • 2602.07274 • Published Feb 6 • 207
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers Paper • 2602.18292 • Published 20 days ago • 10
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 176
view article Article Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective Jan 27 • 64
Enterprise Agents and Benchmarks Collection Enterprise agent ecosystem featuring AssetOpsBench (industrial) and ITBench (SRE, FinOps, CISO), CUGA to accelerate AI Automation • 10 items • Updated 26 days ago • 14
Toward Efficient Agents: Memory, Tool learning, and Planning Paper • 2601.14192 • Published Jan 20 • 56
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering Paper • 2507.11527 • Published Jul 15, 2025 • 35
view article Article The Agent Era Is Here: A Comprehensive Survey of Large Language Model Agents Apr 8, 2025 • 3