When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation Paper • 2510.07238 • Published Oct 8 • 14
BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Paper • 2510.00232 • Published Sep 30 • 15
Representation & Optimization Collection Understanding about representation sheds light on optimization • 98 items • Updated Nov 3 • 5
Who's Your Judge? On the Detectability of LLM-Generated Judgments Paper • 2509.25154 • Published Sep 29 • 29
Mem-α: Learning Memory Construction via Reinforcement Learning Paper • 2509.25911 • Published Sep 30 • 14
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published Sep 26 • 134
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning Paper • 2509.04744 • Published Sep 5 • 11
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Paper • 2507.06261 • Published Jul 7 • 64
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions Paper • 2507.05257 • Published Jul 7 • 14