view article Article Exploring Environments Hub: Your Language Model needs better (open) environments to learn anakin87 • Sep 4, 2025 • 31
view article Article Context Engineering & Reuse Pattern Under the Hood of Claude Code kobe0938 • Dec 22, 2025 • 7
Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills Paper • 2604.05333 • Published Apr 7 • 23
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces Paper • 2604.05172 • Published Apr 6 • 24
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published Mar 2 • 64
MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents Paper • 2603.09827 • Published Mar 10 • 30
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 37
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? Paper • 2510.02209 • Published Oct 2, 2025 • 57
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published Feb 13 • 62
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong Paper • 2501.09775 • Published Jan 16, 2025 • 32
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Paper • 2503.02003 • Published Mar 3, 2025 • 47