Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why Paper • 2606.19602 • Published 4 days ago • 4
CLUE: A Clinical Language Understanding Evaluation for LLMs Paper • 2404.04067 • Published Apr 5, 2024
WideSearch: Benchmarking Agentic Broad Info-Seeking Paper • 2508.07999 • Published Aug 11, 2025 • 113
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts Paper • 2508.09848 • Published Aug 13, 2025 • 71
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds Paper • 2508.12782 • Published Aug 18, 2025 • 25
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16, 2025 • 73
ibm-granite/granite-speech-3.2-8b Automatic Speech Recognition • 8B • Updated Apr 16, 2025 • 2.19k • 88