Agent Safety & Alignment - a stefanwiest Collection

stefanwiest 's Collections

Multi-Agent Coordination & Signaling

LLM Reasoning & Planning Techniques

Memory, Context & RAG

LLM Quality & Degradation Monitoring

Tool Use & Agent Execution

Agent Safety & Alignment

Agent Safety & Alignment

updated Jan 2

AI agent security: prompt injection defense, jailbreak detection, guardrails, constitutional AI, zero trust architecture.

meta-llama/Llama-Guard-3-8B

Text Generation • 8B • Updated Oct 11, 2024 • 64.6k • • 298

Note Content moderation model
Jailbroken: How Does LLM Safety Training Fail?

Paper • 2307.02483 • Published Jul 5, 2023 • 16
Constitutional AI: Harmlessness from AI Feedback

Paper • 2212.08073 • Published Dec 15, 2022 • 4
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Paper • 2312.06674 • Published Dec 7, 2023 • 9
Language Models (Mostly) Know What They Know

Paper • 2207.05221 • Published Jul 11, 2022 • 1
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

Paper • 2511.22570 • Published Nov 27, 2025 • 95