Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking Paper • 2601.02669 • Published 26 days ago • 3
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs Paper • 2601.03559 • Published 25 days ago • 13
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs Paper • 2601.03559 • Published 25 days ago • 13
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs Paper • 2601.03559 • Published 25 days ago • 13
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Paper • 2504.07981 • Published Apr 4, 2025 • 3
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness Paper • 2507.01702 • Published Jul 2, 2025 • 4
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models Paper • 2502.17924 • Published Feb 25, 2025
AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries? Paper • 2508.09631 • Published Aug 13, 2025
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty Paper • 2510.00732 • Published Oct 1, 2025 • 6
EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty Paper • 2510.00732 • Published Oct 1, 2025 • 6
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20, 2025 • 43
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Paper • 2504.07981 • Published Apr 4, 2025 • 3
Mercury: Ultra-Fast Language Models Based on Diffusion Paper • 2506.17298 • Published Jun 17, 2025 • 7
ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges Paper • 2411.18932 • Published Nov 28, 2024 • 1