FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper โข 2509.17177 โข Published Sep 21, 2025 โข 13