When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8, 2025 • 15
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30, 2025 • 16
Representation & Optimization Collection Understanding about representation sheds light on optimization • 120 items • Updated about 5 hours ago • 7
Who's Your Judge? On the Detectability of LLM-Generated Judgments Paper • 2509.25154 • Published Sep 29, 2025 • 30
Mem-α: Learning Memory Construction via Reinforcement Learning Paper • 2509.25911 • Published Sep 30, 2025 • 15
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published Sep 26, 2025 • 135
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning Paper • 2509.04744 • Published Sep 5, 2025 • 12
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Paper • 2507.06261 • Published Jul 7, 2025 • 66
MIRIX: Multi-Agent Memory System for LLM-Based Agents Paper • 2507.07957 • Published Jul 10, 2025 • 80
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions Paper • 2507.05257 • Published Jul 7, 2025 • 14