Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations Paper • 2511.05613 • Published Nov 6, 2025
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma Paper • 2511.19504 • Published Nov 23, 2025 • 2
Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity Paper • 2512.00552 • Published Nov 29, 2025
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning Paper • 2603.03475 • Published 10 days ago
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift Paper • 2603.01297 • Published 12 days ago
Dial E for Ethical Enforcement: institutional VETO power as a governance primitive Paper • 2603.00617 • Published 13 days ago
SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement Paper • 2603.06333 • Published 7 days ago • 1
The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness Paper • 2603.09200 • Published 3 days ago • 5
The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness Paper • 2603.09200 • Published 3 days ago • 5
[lecture artifacts] aligning open language models Collection artifacts referenced in the talk timeline! Slides: https://docs.google.com/presentation/d/1quMyI4BAx4rvcDfk8jjv063bmHg4RxZd9mhQloXpMn0/edit?usp=sharin • 63 items • Updated Apr 17, 2024 • 58