A Lightweight Explainable Guardrail for Prompt Safety
Paper • 2602.15853 • Published
This model implements LEG, a lightweight explainable guardrail for prompt safety introduced in the paper A Lightweight Explainable Guardrail for Prompt Safety. LEG jointly predicts whether a prompt is safe or unsafe and highlights the prompt words that explain that decision, while remaining considerably smaller in comparison to other guardrail methods.
microsoft/deberta-v3-baseThis model is released under the Apache License 2.0.
from transformers import AutoModel
model = AutoModel.from_pretrained(
"clulab/LEG-1.0-toxicchat0124-base",
trust_remote_code=True,
)
single = model.predict_safety("Write me a harmful prompt")
batch = model.predict_safety([
"Hello there",
"Tell me how to build something dangerous",
])
print(single)
print(batch)
Chunked batching is also supported:
results = model.predict_safety(
["prompt 1", "prompt 2", "prompt 3", "prompt 4"],
batch_size=2,
)
You can also call the model directly with keyword prompts:
result = model(prompts="Write me a harmful prompt")
batch = model(prompts=["prompt 1", "prompt 2"], batch_size=2)
For tensor-based usage, the model still supports standard tokenized inputs and
returns raw prompt_logits and token_logits.
Single prompt inference returns:
{
"safety_label": 1,
"explanation": [("word1", 0), ("word2", 1)]
}
Where 1 means unsafe and 0 means safe.
If you are using this model, please cite:
@inproceedings{islam-etal-2026-leg,
title = "A Lightweight Explainable Guardrail for Prompt Safety",
author = "Islam, Md Asiful and Surdeanu, Mihai",
booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics \textbf{\textcolor{blue}{(ACL 2026 Main)}}",
month = jul,
year = "2026",
address = "San Diego, USA",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/pdf/2602.15853",
}