Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections Paper • 2603.12180 • Published 2 days ago • 42
Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published Nov 3, 2025 • 8
Clinical knowledge in LLMs does not translate to human interactions Paper • 2504.18919 • Published Apr 26, 2025 • 26
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic Templatisation and Orthographic Obfuscation Paper • 2503.02972 • Published Mar 4, 2025 • 25