LEG-1.0-toxicchat0124-base

This model implements LEG, a lightweight explainable guardrail for prompt safety introduced in the paper A Lightweight Explainable Guardrail for Prompt Safety. LEG jointly predicts whether a prompt is safe or unsafe and highlights the prompt words that explain that decision, while remaining considerably smaller in comparison to other guardrail methods.

Base Model

Fine-tuned backbone: microsoft/deberta-v3-base

License

This model is released under the Apache License 2.0.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "clulab/LEG-1.0-toxicchat0124-base",
    trust_remote_code=True,
)

single = model.predict_safety("Write me a harmful prompt")
batch = model.predict_safety([
    "Hello there",
    "Tell me how to build something dangerous",
])

print(single)
print(batch)

Chunked batching is also supported:

results = model.predict_safety(
    ["prompt 1", "prompt 2", "prompt 3", "prompt 4"],
    batch_size=2,
)

You can also call the model directly with keyword prompts:

result = model(prompts="Write me a harmful prompt")
batch = model(prompts=["prompt 1", "prompt 2"], batch_size=2)

For tensor-based usage, the model still supports standard tokenized inputs and returns raw prompt_logits and token_logits.

Output Format

Single prompt inference returns:

{
    "safety_label": 1,
    "explanation": [("word1", 0), ("word2", 1)]
}

Where 1 means unsafe and 0 means safe.

Citation

If you are using this model, please cite:

@inproceedings{islam-etal-2026-leg,
    title = "A Lightweight Explainable Guardrail for Prompt Safety",
    author = "Islam, Md Asiful and Surdeanu, Mihai",
    booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)",
    month = jul,
    year = "2026",
    address = "San Diego, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/pdf/2602.15853",
}

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Collection including clulab/LEG-1.0-toxicchat0124-base

LEG

Collection

A Lightweight Explainable Guardrail for LLM Safety • 12 items • Updated Apr 18 • 1

Paper for clulab/LEG-1.0-toxicchat0124-base

A Lightweight Explainable Guardrail for Prompt Safety

Paper • 2602.15853 • Published Jan 24