LEG-1.0-toxicchat0124-base

This model implements LEG, a lightweight explainable guardrail for prompt safety introduced in the paper A Lightweight Explainable Guardrail for Prompt Safety. LEG jointly predicts whether a prompt is safe or unsafe and highlights the prompt words that explain that decision, while remaining considerably smaller in comparison to other guardrail methods.

Base Model

  • Fine-tuned backbone: microsoft/deberta-v3-base

License

This model is released under the Apache License 2.0.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "clulab/LEG-1.0-toxicchat0124-base",
    trust_remote_code=True,
)

single = model.predict_safety("Write me a harmful prompt")
batch = model.predict_safety([
    "Hello there",
    "Tell me how to build something dangerous",
])

print(single)
print(batch)

Chunked batching is also supported:

results = model.predict_safety(
    ["prompt 1", "prompt 2", "prompt 3", "prompt 4"],
    batch_size=2,
)

You can also call the model directly with keyword prompts:

result = model(prompts="Write me a harmful prompt")
batch = model(prompts=["prompt 1", "prompt 2"], batch_size=2)

For tensor-based usage, the model still supports standard tokenized inputs and returns raw prompt_logits and token_logits.

Output Format

Single prompt inference returns:

{
    "safety_label": 1,
    "explanation": [("word1", 0), ("word2", 1)]
}

Where 1 means unsafe and 0 means safe.

Citation

If you are using this model, please cite:

@inproceedings{islam-etal-2026-leg,
    title = "A Lightweight Explainable Guardrail for Prompt Safety",
    author = "Islam, Md Asiful and Surdeanu, Mihai",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics \textbf{\textcolor{blue}{(ACL 2026 Main)}}",
    month = jul,
    year = "2026",
    address = "San Diego, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/pdf/2602.15853",
}
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for clulab/LEG-1.0-toxicchat0124-base