SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge Paper β’ 2505.21605 β’ Published May 27, 2025
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data Paper β’ 2510.09781 β’ Published Oct 10, 2025 β’ 27
PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory Paper β’ 2512.06688 β’ Published Dec 7, 2025 β’ 2
Emergent Social Intelligence Risks in Generative Multi-Agent Systems Paper β’ 2603.27771 β’ Published Mar 29 β’ 52
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? Paper β’ 2605.12684 β’ Published May 12 β’ 11
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? Paper β’ 2606.05080 β’ Published 22 days ago β’ 30
Steering Multimodal Large Language Models Decoding for Context-Aware Safety Paper β’ 2509.19212 β’ Published Sep 23, 2025
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? Paper β’ 2510.18003 β’ Published Oct 20, 2025
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? Paper β’ 2605.12684 β’ Published May 12 β’ 11
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Paper β’ 2510.01179 β’ Published Oct 1, 2025 β’ 29