---
library_name: transformers
license: cc-by-nc-4.0
language:
- en
datasets:
- GeneralAnalysis/GA_Guardrail_Benchmark
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- Moderation
- Safety
- Filter
---
<p align="center">
  <img alt="GA Guard Family" src="https://www.generalanalysis.com/blog/ga_guard_series/GA_Guards_Header.webp">
</p>

<p align="center">
  <a href="https://Generalanalysis.com"><strong>Website</strong></a> ·
  <a href="https://Generalanalysis.com/blog"><strong>GA Blog</strong></a> ·
  <a href="https://huggingface.co/datasets/GeneralAnalysis/GA_Guardrail_Benchmark"><strong>GA Bench</strong></a> ·
  <a href="https://calendly.com/rez-general-analysis/general-analysis-intro"><strong>API Access</strong></a>
</p>

<br>

Introducing the GA Guard series — a family of open-weight moderation models built to help developers and organizations keep language models safe, compliant, and aligned with real-world use.


**GA-Guard** is designed to detect violations across the following seven categories:

- **Illicit Activities** – instructions or content related to crimes, weapons, or illegal substances.  
- **Hate & Abuse** – harassment, slurs, dehumanization, or abusive language.  
- **PII & IP** – exposure or solicitation of sensitive personal information, secrets, or intellectual property.  
- **Prompt Security** – jailbreaks, prompt-injection, secret exfiltration, or obfuscation attempts.  
- **Sexual Content** – sexually explicit or adult material.  
- **Misinformation** – demonstrably false or deceptive claims presented as fact.  
- **Violence & Self-Harm** – content that encourages violence, self-harm, or suicide.  

The model outputs a **structured token** for each category (e.g., `<policy_violation>` or `<policy_not_violation>`).
>[!Note]
> **Important:** This model outputs **special tokens** (e.g. `<hate_and_abuse_not_violation>`). Do **not** use `pipeline("text-generation")` since it strips them by default. Always decode with `skip_special_tokens=False` to preserve the outputs.

## Model Details

GA Guard Core features:
- Type: Causal Language Model 
- Training: Full finetune
- Number of Parameters: 4.0B
- Number of Non-Embedding Parameters: 3.6B
- Number of Layers: 36
- Number of Attention Heads (GQA): 32 for Q and 8 for KV
- Context Length: 262,144 tokens 


## Inference Examples

### Transformers Library
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GeneralAnalysis/GA_Guard_Core")
model = AutoModelForCausalLM.from_pretrained("GeneralAnalysis/GA_Guard_Core")

messages = [
    {"role": "user", "content": "Who are you?"},
]

# The chat template automatically adds the guardrail system prompt and prefixes user messages with "text:".
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)

# Decode only the newly generated tokens
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

# Sample output:
# <hate_and_abuse_not_violation><illicit_activities_not_violation>...
```
## Benchmarks
We evaluated GA Guards on public moderation suites (OpenAI Moderation, WildGuard Benchmark, and HarmBench), our adversarial GA Jailbreak Bench, and the new GA Long-Context Bench. Across all three, our models consistently outperform major cloud guardrails and even surpass GPT-5 (when prompted to act as a guardrail).

<p align="left">
  <img alt="GA Guard Family" src="https://www.generalanalysis.com/blog/ga_guard_series/public-benchmarks.webp" width="100%">
</p>

<br>

### Public Benchmarks

On public moderation suites, Guard Thinking reports 0.906 F1, Guard 0.899, and Lite 0.875 — all higher than GPT-5 (0.864) and GPT-5-mini (0.852), with cloud guardrails in the 0.62–0.74 range.

| Guard                       | OpenAI Moderation (Acc/F1/FPR) | WildGuard (Acc/F1/FPR) | HarmBench Behaviors (Acc/F1/FPR) | Avg Time (s) |
|-----------------------------|--------------------------------|-------------------------|-----------------------------------|--------------|
| GA Guard                    | 0.916 / 0.873 / 0.111         | 0.856 / 0.844 / 0.172  | 0.963 / 0.981 / N/A              | 0.029        |
| GA Guard Thinking           | 0.917 / 0.876 / 0.112         | 0.862 / 0.858 / 0.134  | 0.967 / 0.983 / N/A              | 0.650        |
| GA Guard Lite               | 0.896 / 0.844 / 0.109         | 0.835 / 0.819 / 0.176  | 0.929 / 0.963 / N/A              | 0.016        |
| AWS Bedrock Guardrail       | 0.818 / 0.754 / 0.216         | 0.642 / 0.649 / 0.449  | 0.662 / 0.797 / N/A              | 0.375        |
| Azure AI Content Safety     | 0.879 / 0.807 / 0.091         | 0.667 / 0.463 / 0.071  | 0.438 / 0.609 / N/A              | 0.389        |
| Vertex AI Model Armor       | 0.779 / 0.690 / 0.225         | 0.711 / 0.590 / 0.105  | 0.896 / 0.945 / N/A              | 0.873        |
| GPT 5                       | 0.838 / 0.775 / 0.188         | 0.849 / 0.830 / 0.145  | 0.975 / 0.987 / N/A              | 11.275       |
| GPT 5-mini                  | 0.794 / 0.731 / 0.255         | 0.855 / 0.839 / 0.151  | 0.975 / 0.987 / N/A              | 5.604        |
| Llama Guard 4 12B           | 0.826 / 0.737 / 0.156         | 0.799 / 0.734 / 0.071  | 0.925 / 0.961 / N/A              | 0.459        |
| Llama Prompt Guard 2 86M    | 0.686 / 0.015 / 0.009         | 0.617 / 0.412 / 0.143  | 0.200 / 0.333 / N/A              | 0.114        |
| Nvidia Llama 3.1 Nemoguard 8B | 0.852 / 0.793 / 0.174       | 0.849 / 0.818 / 0.096  | 0.875 / 0.875 / N/A              | 0.358        |
| VirtueGuard Text Lite       | 0.507 / 0.548 / 0.699         | 0.656 / 0.682 / 0.491  | 0.875 / 0.933 / N/A              | 0.651        |
| Lakera Guard                | 0.752 / 0.697 / 0.323         | 0.630 / 0.662 / 0.527  | 0.946 / 0.972 / N/A              | 0.377        |
| Protect AI (prompt-injection-v2) | 0.670 / 0.014 / 0.032    | 0.559 / 0.382 / 0.248  | N/A                              | 0.115        |

### [GA Long-Context Bench](https://huggingface.co/datasets/GeneralAnalysis/GA_Long_context_Jailbreak_Benchmark)
On GA Long-Context Bench (up to 256k tokens), GA Guard Thinking scores 0.893 F1, GA Guard 0.891, and Lite 0.885. Cloud baselines collapse: Vertex 0.560, AWS misclassifies nearly all inputs with a 1.0 false-positive rate, and Azure records just 0.046 F1.

| Guard                       | Accuracy | F1 Score | FPR  | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinformation | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|-----------------------------|----------|----------|------|-----------------|-----------------------|-------------------|-------------|--------------------|-------------------|-------------------------|
| GA Guard                    | 0.887    | 0.891    | 0.147| 0.983           | 0.972                 | 0.966             | 0.976       | 0.875              | 0.966             | 0.988                   |
| GA Guard Thinking           | 0.889    | 0.893    | 0.151| 0.967           | 0.951                 | 0.940             | 0.961       | 0.828              | 0.920             | 0.962                   |
| GA Guard Lite               | 0.881    | 0.885    | 0.148| 0.979           | 0.969                 | 0.972             | 0.976       | 0.846              | 0.973             | 0.985                   |
| AWS Bedrock Guardrail       | 0.532    | 0.695    | 1.000| 0.149           | 0.211                 | 0.131             | 0.367       | 0.175              | 0.092             | 0.157                   |
| Azure AI Content Safety     | 0.480    | 0.046    | 0.001| 0.028           | 0.041                 | 0.016             | 0.073       | 0.049              | 0.000             | 0.081                   |
| Vertex AI Model Armor       | 0.635    | 0.560    | 0.138| 0.187           | 0.312                 | 0.109             | 0.473       | 0.194              | 0.085             | 0.241                   |
| GPT 5                       | 0.764    | 0.799    | 0.372| 0.219           | 0.297                 | 0.189             | 0.404       | 0.243              | 0.137             | 0.229                   |
| GPT 5-mini                  | 0.697    | 0.772    | 0.607| 0.184           | 0.253                 | 0.157             | 0.412       | 0.215              | 0.112             | 0.190                   |
| Llama Guard 4 12B           | 0.569    | 0.602    | 0.516| 0.164           | 0.228                 | 0.132             | 0.334       | 0.188              | 0.097             | 0.195                   |
| Llama Prompt Guard 2 86M    | 0.505    | 0.314    | 0.162| N/A             | N/A                   | N/A               | N/A         | 0.093              | N/A               | N/A                     |
| Nvidia Llama 3.1 Nemoguard 8B | 0.601  | 0.360    | 0.021| 0.243           | 0.288                 | 0.097             | 0.192       | 0.116              | 0.305             | 0.321                   |
| VirtueGuard Text Lite       | 0.490    | 0.147    | 0.047| 0.082           | 0.203                 | 0.118             | 0.069       | 0.074              | 0.058             | 0.132                   |
| Lakera Guard                | 0.520    | 0.684    | 0.999| 0.151           | 0.200                 | 0.132             | 0.361       | 0.160              | 0.093             | 0.159                   |
| Protect AI (prompt-injection-v2) | 0.496| 0.102   | 0.001| N/A             | N/A                   | N/A               | N/A         | 0.032              | N/A               | N/A                     |

### [GA Jailbreak Bench](https://huggingface.co/datasets/GeneralAnalysis/GA_Jailbreak_Benchmark)
On GA Jailbreak Bench, which measures resilience against adversarial attacks, Guard Thinking achieves 0.933 F1, Guard 0.930, and Lite 0.898. GPT-5 reaches 0.893, while cloud guardrails fall significantly lower.

| Guard                       | Accuracy | F1 Score | FPR  | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinf. | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|-----------------------------|----------|----------|------|-----------------|-----------------------|------------|-------------|--------------------|-------------------|-------------------------|
| GA Guard                    | 0.931    | 0.930    | 0.038| 0.946           | 0.939                 | 0.886      | 0.967       | 0.880              | 0.954             | 0.928                   |
| GA Guard Thinking           | 0.939    | 0.933    | 0.029| 0.965           | 0.925                 | 0.894      | 0.962       | 0.885              | 0.942             | 0.946                   |
| GA Guard Lite               | 0.902    | 0.898    | 0.065| 0.908           | 0.900                 | 0.856      | 0.936       | 0.850              | 0.934             | 0.904                   |
| AWS Bedrock Guardrail       | 0.606    | 0.607    | 0.396| 0.741           | 0.456                 | 0.535      | 0.576       | 0.649              | 0.721             | 0.518                   |
| Azure AI Content Safety     | 0.542    | 0.193    | 0.026| 0.236           | 0.093                 | 0.155      | 0.068       | 0.416              | 0.186             | 0.130                   |
| Vertex AI Model Armor       | 0.550    | 0.190    | 0.008| 0.077           | 0.190                 | 0.582      | 0.076       | 0.000              | 0.000             | 0.241                   |
| GPT 5                       | 0.900    | 0.893    | 0.035| 0.928           | 0.942                 | 0.856      | 0.799       | 0.819              | 0.953             | 0.939                   |
| GPT 5-mini                  | 0.891    | 0.883    | 0.050| 0.917           | 0.942                 | 0.845      | 0.850       | 0.822              | 0.882             | 0.924                   |
| Llama Guard 4 12B           | 0.822    | 0.796    | 0.053| 0.768           | 0.774                 | 0.587      | 0.809       | 0.833              | 0.927             | 0.827                   |
| Llama Prompt Guard 2 86M    | 0.490    | 0.196    | 0.069| N/A             | N/A                   | N/A        | N/A         | 0.196              | N/A               | N/A                     |
| Nvidia Llama 3.1 Nemoguard 8B | 0.668  | 0.529    | 0.038| 0.637           | 0.555                 | 0.513      | 0.524       | 0.049              | 0.679             | 0.575                   |
| VirtueGuard Text Lite       | 0.513    | 0.664    | 0.933| 0.659           | 0.689                 | 0.657      | 0.646       | 0.659              | 0.675             | 0.662                   |
| Lakera Guard                | 0.525    | 0.648    | 0.825| 0.678           | 0.645                 | 0.709      | 0.643       | 0.631              | 0.663             | 0.548                   |
| Protect AI (prompt-injection-v2) | 0.528| 0.475   | 0.198| N/A             | N/A                   | N/A        | N/A         | 0.475              | N/A               | N/A                     |


## Licensing

This model is a fine-tune of [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B),
which is licensed under the **Apache License 2.0** by Alibaba Cloud.  
The upstream license text is included in this repository as `LICENSE.Apache`, and
attribution is provided in the `NOTICE` file.

**GA Guard Core** in this repository is provided under the
**Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license
for non-commercial use.

- Free for research, experimentation, and non-commercial internal use
- No commercial or production deployment without a separate commercial license

For **commercial / production use**, please contact **info@generalanalysis.com** to obtain a
paid license and support agreement.


## Citation [optional]

```bibtex
@misc{generalanalysis2025gaguardcore,
      title        = {GA Guard Core}, 
      author       = {Rez Havaei and Rex Liu and General Analysis},
      year         = {2025},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      howpublished = {\url{https://huggingface.co/GeneralAnalysis/GA_Guard_Core}},
      note         = {Open-weight moderation model for seven safety categories},
}
```