SPHINX: A Synthetic Environment for Visual Perception and Reasoning Paper โข 2511.20814 โข Published Nov 25, 2025 โข 2
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning Paper โข 2510.27044 โข Published Oct 30, 2025 โข 6
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence Paper โข 2511.01144 โข Published Nov 3, 2025 โข 4
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence Paper โข 2406.07599 โข Published Jun 11, 2024 โข 1