ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces Paper • 2604.05172 • Published 4 days ago • 20
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published Feb 13 • 59
UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing Paper • 2402.03396 • Published Feb 4, 2024
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published Feb 13 • 59