Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games Paper • 2606.19338 • Published 12 days ago • 48
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context Paper • 2605.13831 • Published May 13 • 89 • 3
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation Paper • 2606.12594 • Published 19 days ago • 17
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published about 1 month ago • 28
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? Paper • 2605.06527 • Published May 7 • 47
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Paper • 2605.10912 • Published May 11 • 46
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Paper • 2605.14906 • Published May 14 • 79
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Paper • 2605.14906 • Published May 14 • 79
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? Paper • 2605.06527 • Published May 7 • 47
MMProLong Collection A 7B LVLM with 128K context window and 512K generalization through long-context continued pre-training • 1 item • Updated May 15
CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration? Paper • 2510.24505 • Published Oct 28, 2025 • 5
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context Paper • 2605.13831 • Published May 13 • 89
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context Paper • 2605.13831 • Published May 13 • 89
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context Paper • 2605.13831 • Published May 13 • 89
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions Paper • 2505.00675 • Published May 1, 2025 • 3
AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph Paper • 2311.09174 • Published Nov 15, 2023
AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation Paper • 2402.10646 • Published Feb 16, 2024
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers Paper • 2509.03059 • Published Sep 3, 2025 • 25
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents Paper • 2510.07172 • Published Oct 8, 2025 • 28