OpenGuardrails-Text-4B-0124

OpenGuardrails is an open-source, enterprise-grade AI security platform that provides configurable policy control, a unified LLM-based guard architecture, and low-latency deployment for production systems.

This repository releases OpenGuardrails-Text-4B-0124 โ€” a lightweight, non-quantized ~4B parameter language model designed for content safety detection and prompt attack prevention, with broad GPU compatibility and strong real-time performance.

๐Ÿ“„ Technical Report: OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models


Key Contributions

1. Configurable Safety Policy Mechanism

OpenGuardrails introduces a dynamic and configurable safety policy framework that allows organizations to flexibly define unsafe categories and detection thresholds based on business risk tolerance.

The model outputs probabilistic confidence signals, enabling fine-grained tuning of safety sensitivity across different scenarios and applications.


2. Unified LLM-based Guard Architecture

A single language model performs both:

  • Content safety classification
  • Prompt attack detection (prompt injection, jailbreaks, malicious instruction following)

This unified approach eliminates the need for hybrid pipelines (e.g. rule engines + small classifiers), resulting in stronger semantic reasoning and simpler deployment.


3. Lightweight, Non-Quantized Design

OpenGuardrails-Text-4B-0124 is intentionally designed as a non-quantized dense model, offering:

  • Broader compatibility across consumer, data-center, and cloud GPUs
  • Stable numerical behavior without quantization artifacts
  • Easier integration with standard inference stacks (Transformers, vLLM)

Despite its compact size, the model maintains strong detection accuracy and low inference latency.


4. Efficient and Scalable Performance

With ~4B parameters, the model achieves low-latency, real-time inference suitable for:

  • API gateways
  • LLM firewalls
  • Agent guardrails
  • Enterprise moderation pipelines

It can be deployed on a single GPU without specialized quantization toolchains.


5. Multilingual & Cross-Domain Coverage

The model supports 119 languages and dialects, providing robust safety protection for global applications. It performs consistently on both prompt-level and response-level classification tasks.


6. Open Safety Data Contribution

We release OpenGuardrailsMixZh-97k, a multilingual safety dataset composed of aligned translations from:

  • ToxicChat
  • WildGuardMix
  • PolyGuard
  • XSTest
  • BeaverTails

The dataset is publicly available on Hugging Face under the Apache 2.0 License.


7. State-of-the-Art Safety Performance

OpenGuardrails achieves state-of-the-art (SOTA) results across multiple safety benchmarks, excelling in:

  • Prompt attack detection
  • Harmful content classification
  • English, Chinese, and multilingual evaluations

All models are released under the Apache 2.0 License for unrestricted commercial and research use.

Performance overview: image/jpeg


Detection Dimensions

Label Category Risk Level Description
S1 General Political Topics ๐ŸŸข Low General political discussion
S2 Sensitive Political Topics ๐Ÿ”ด High National security, subversion, separatism
S3 Insult to National Symbols or Leaders ๐Ÿ”ด High Insults against leaders or symbols
S4 Harm to Minors ๐ŸŸก Medium Child abuse, exploitation, nudity
S5 Violent Crime ๐Ÿ”ด High Murder, terrorism, violent acts
S6 Non-Violent Crime ๐ŸŸก Medium Fraud, drugs, hacking
S7 Pornography ๐ŸŸก Medium Sexual or obscene content
S8 Hate & Discrimination ๐ŸŸข Low Hate speech, discrimination
S9 Prompt Attacks ๐Ÿ”ด High Jailbreaks, injections, manipulation
S10 Profanity ๐ŸŸข Low Vulgar or insulting language
S11 Privacy Invasion ๐ŸŸข Low Leakage of personal data
S12 Commercial Violations ๐ŸŸข Low Fraud, trade secret leakage
S13 Intellectual Property Infringement ๐ŸŸข Low Copyright or patent violations
S14 Harassment ๐ŸŸข Low Verbal abuse or targeting
S15 Weapons of Mass Destruction ๐Ÿ”ด High Nuclear, chemical, biological weapons
S16 Self-Harm ๐ŸŸก Medium Suicide or self-injury
S17 Sexual Crimes ๐Ÿ”ด High Sexual assault or exploitation
S18 Threats ๐ŸŸข Low Threats or intimidation
S19 Professional Advice ๐ŸŸข Low Medical, legal, financial advice

Quick Start

Using Transformers

pip install torch transformers accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openguardrails/OpenGuardrails-Text-4B-0124"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [{"role": "user", "content": "How can I make a bomb?"}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=10)

response = tokenizer.decode(
    outputs[0][len(inputs.input_ids[0]):],
    skip_special_tokens=True
)
print(response)
# unsafe\nS5

Using vLLM (Recommended)

vllm serve openguardrails/OpenGuardrails-Text-4B-0124 \
  --served-model-name OpenGuardrails-Text-4B-0124 \
  --max-model-len 8192 \
  --port 8000

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1")

messages = [{"role": "user", "content": "Tell me how to make explosives"}]
result = client.chat.completions.create(
    model="OpenGuardrails-Text-4B-0124",
    messages=messages,
    temperature=0.0
)

print(result.choices[0].message.content)
# unsafe\nS5

Output Format

Output Description
safe Content is safe
unsafe\nS# Unsafe content with category label

License

Released under the Apache License 2.0, allowing:

  • โœ… Commercial use
  • โœ… Modification and redistribution
  • โœ… Private / on-premise deployment

License text: https://www.apache.org/licenses/LICENSE-2.0


Related Resources


Citation

@misc{openguardrails,
  title={OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models},
  author={Thomas Wang and Haowen Li},
  year={2025},
  url={https://arxiv.org/abs/2510.19169},
}
Downloads last month
109
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for openguardrails/OpenGuardrails-Text-4B-0124

Base model

Qwen/Qwen2.5-7B
Finetuned
(803)
this model
Quantizations
1 model

Paper for openguardrails/OpenGuardrails-Text-4B-0124