AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security Paper • 2601.18491 • Published 3 days ago • 62
AgentDoG Collection A Diagnostic Guardrail Framework for AI Agent Safety and Security • 11 items • Updated about 15 hours ago • 80
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Paper • 2507.11097 • Published Jul 15, 2025 • 64
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions Paper • 2510.08211 • Published Oct 9, 2025 • 22
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents Paper • 2509.26354 • Published Sep 30, 2025 • 18
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning Paper • 2506.02867 • Published Jun 3, 2025 • 2
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback Paper • 2501.12895 • Published Jan 22, 2025 • 61