CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Paper • 2408.01605 • Published Aug 2, 2024 • 2
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models Paper • 2404.13161 • Published Apr 19, 2024
How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition Paper • 2603.15714 • Published Mar 16
LlamaFirewall: An open source guardrail system for building secure AI agents Paper • 2505.03574 • Published May 6, 2025
CocoaBench: Evaluating Unified Digital Agents in the Wild Paper • 2604.11201 • Published 26 days ago • 36
How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition Paper • 2603.15714 • Published Mar 16
TextQuests: How Good are LLMs at Text-Based Video Games? Paper • 2507.23701 • Published Jul 31, 2025 • 3
Training Language Models to Generate Quality Code with Program Analysis Feedback Paper • 2505.22704 • Published May 28, 2025 • 14
Model with Circuit Breakers Collection SoTA models with circuit breakers inserted. Top safety performance without losing capabilities. • 3 items • Updated Oct 25, 2024 • 5
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 7
Representation Engineering: A Top-Down Approach to AI Transparency Paper • 2310.01405 • Published Oct 2, 2023 • 7