Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM Paper • 2509.18058 • Published Sep 22, 2025 • 12
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility Paper • 2504.07086 • Published Apr 9, 2025 • 21
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders Paper • 2409.14507 • Published Sep 22, 2024 • 1
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Paper • 2404.01318 • Published Mar 28, 2024
A Modern Look at the Relationship between Sharpness and Generalization Paper • 2302.07011 • Published Feb 14, 2023