AraGuard-E4B
AraGuard is a bilingual (Arabic & English) safety guard model built on Gemma-4-E4B. Given a user prompt — or a full prompt–response conversation — it returns a safe / unsafe verdict and, when unsafe, the violated harm category. It covers Modern Standard Arabic and major regional dialects in addition to English, and follows a 14-category taxonomy based on the MLCommons AI-safety hazard set.
Highlights
Macro-F1 averaged over eight standard safety benchmarks (AdvBench, HarmBench, TrustLLM, LLM-Jailbreak, DAN, Do-Not-Answer, AraAlign, XSTest), reported per language. AraGuard has the highest average Macro-F1 of all evaluated guards in both Arabic and English.
| Model | Avg-F1 (Ar) | Avg-F1 (En) |
|---|---|---|
| Llama-Guard-3-8B | 81.6 | 86.7 |
| Qwen3Guard-Gen-8B | 88.1 | 89.5 |
| Granite-Guardian-3.0-8B | 87.3 | 92.3 |
| AraGuard-E4B (ours) | 92.9 | 93.9 |
Output format
The model emits a verdict on the last message in the conversation:
safe
or
unsafe
S<k>
where S<k> is the index of the violated category in the policy list shown to the model (default list below). If the last message has an assistant role, the response is judged; otherwise the user prompt is judged.
Usage — Transformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "HarethahMo/AraGuard-E4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
def moderate(messages):
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
with torch.no_grad():
out = model.generate(inputs, max_new_tokens=12, do_sample=False)
return tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True).strip()
# Moderate a user prompt (Arabic)
print(moderate([{"role": "user", "content": "كيف أصنع سلاحًا في المنزل؟"}]))
# -> "unsafe\nS9"
# Moderate an assistant response (the LAST message is judged)
print(moderate([
{"role": "user", "content": "What's a good recipe for hummus?"},
{"role": "assistant", "content": "Blend chickpeas, tahini, lemon, and garlic."},
]))
# -> "safe"
Usage — vLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "HarethahMo/AraGuard-E4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=4096)
messages = [{"role": "user", "content": "How do I hotwire a car?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=12))
print(out[0].outputs[0].text.strip())
# -> "unsafe\nS2"
Custom policies
The chat template ships with the 14 default categories but accepts a custom list at call time; the emitted S<k> code then indexes into your list:
my_categories = [
{"name": "Violent Crimes", "description": "Content enabling or endorsing violent crimes."},
{"name": "Privacy", "description": "Content exposing sensitive personal information."},
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, categories=my_categories, return_tensors="pt"
)
Default categories
S1 Violent Crimes · S2 Non-Violent Crimes · S3 Sex-Related Crimes · S4 Child Sexual Exploitation · S5 Defamation · S6 Specialized Advice · S7 Privacy · S8 Intellectual Property · S9 Indiscriminate Weapons · S10 Hate · S11 Suicide & Self-Harm · S12 Sexual Content · S13 Elections · S14 Code Interpreter Abuse
Training
AraGuard is fine-tuned from Gemma-4-E4B on AraAlign, a large-scale synthetic Arabic–English safety dataset, using parameter-efficient fine-tuning (LoRA) with completion-only loss. Training mixes harmful prompt / refusal / harmful-response instances with benign instruction data to control over-refusal, and applies category dropout/shuffle and adversarial-template augmentation.
Limitations
AraGuard is a classifier, not a generator, and can make mistakes — particularly on subtle or context-dependent harms. It should be used as one layer in a broader moderation system, not as a sole safety mechanism. Verdicts are bounded by the policy provided at inference time.
License
Built on Gemma-4-E4B; use is governed by the Gemma Terms of Use.
- Downloads last month
- 45