Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
stefanwiest 's Collections
Multi-Agent Coordination & Signaling
LLM Reasoning & Planning Techniques
Memory, Context & RAG
LLM Quality & Degradation Monitoring
Tool Use & Agent Execution
Agent Safety & Alignment

Agent Safety & Alignment

updated 12 days ago

AI agent security: prompt injection defense, jailbreak detection, guardrails, constitutional AI, zero trust architecture.

Upvote
-

  • meta-llama/Llama-Guard-3-8B

    Text Generation • 8B • Updated Oct 11, 2024 • 32.1k • • 260

    Note Content moderation model


  • Jailbroken: How Does LLM Safety Training Fail?

    Paper • 2307.02483 • Published Jul 5, 2023 • 14

  • Constitutional AI: Harmlessness from AI Feedback

    Paper • 2212.08073 • Published Dec 15, 2022 • 4

  • Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Paper • 2312.06674 • Published Dec 7, 2023 • 8

  • Language Models (Mostly) Know What They Know

    Paper • 2207.05221 • Published Jul 11, 2022 • 1

  • DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning

    Paper • 2511.22570 • Published Nov 27, 2025 • 87
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs