You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Access to this guardrail model is granted upon agreement to the terms below. This model is a research artifact designed for content safety and moderation. By accessing this model, you agree to use it responsibly and acknowledge its technical limitations.
Log in or Sign Up to review the conditions and access this model content.
Model Card for ExpGuard (1.5B)
ExpGuard is an open moderation model that specializes in safety content moderation within specialized domains (Finance, Healthcare, Law). It achieves two main goals:
- Detection of harm in user prompts.
- Detection of harm in LLM responses.
ExpGuard is trained on ExpGuardTrain. Our evaluations on ExpGuardTest and various existing public benchmarks show that ExpGuard effectively detects risks in specialized contexts, while maintaining the state-of-the-art performance in general safety.
For more details, read the paper: ExpGuard: LLM Content Moderation in Specialized Domains.
Model Description
- Model type: Safety moderation model trained on domain-specific and general safety datasets.
- Language(s): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-1.5B
How to Use ExpGuard
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "6rightjade/expguard-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
def classify_content(prompt: str, response: str = None):
messages = [{"role": "user", "content": prompt}]
if response:
messages.append({"role": "assistant", "content": response})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.0,
top_p=1.0,
do_sample=False
)
result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
return result
# Example 1: Check a user prompt
prompt_text = "How can I commit tax fraud?"
print(classify_content(prompt_text))
### Example 1 Output:
# User Safety Assessment: unsafe
# Example 2: Check both a user prompt and an LLM response
response_text = "I cannot help you with that."
print(classify_content(prompt_text, response_text))
### Example 2 Output:
# User Safety Assessment: unsafe
# Agent Safety Assessment: safe
Specialized Domains
ExpGuard is specifically designed to handle safety risks in the following domains:
- Finance: Detection of financial fraud, scams, and unethical financial advice.
- Healthcare: Identification of medical misinformation, harmful health advice, and self-harm content.
- Law: Recognition of illegal acts, copyright violations, and dangerous legal guidance.
Intended Uses of ExpGuard
- Moderation tool: ExpGuard is intended to be used for content moderation in specialized applications (Finance, Medical, Legal), specifically for classifying harmful user requests (prompts) and model responses.
- Safety Evaluation: It can be used to evaluate the safety of other LLMs operating in these high-stakes domains.
Evaluation Results
The following are F1 scores for prompt/response harmfulness classification on public benchmarks and ExpGuardTest (higher is better).
| Task | Public Avg. F1 | ExpGuardTest Total F1 | Finance F1 | Healthcare F1 | Law F1 |
|---|---|---|---|---|---|
| Prompt | 77.9 | 89.6 | 90.2 | 88.1 | 90.4 |
| Response | 69.7 | 92.7 | 95.5 | 86.5 | 94.5 |
Limitations
Though it shows strong performance in specialized domains, ExpGuard will sometimes make incorrect judgments. When used within an automated moderation system, this can potentially allow unsafe model content or harmful requests from users to pass through. Users of ExpGuard should be aware of this potential for inaccuracies and should use it as part of a broader safety system.
Citation
@inproceedings{
choi2026expguard,
title={ExpGuard: {LLM} Content Moderation in Specialized Domains},
author={Choi, Minseok and Kim, Dongjin and Yang, Seungbin and Kim, Subin and Kwak, Youngjun and Oh, Juyoung and Choo, Jaegul and Son, Jungmin},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=t5cYJlV6aJ}
}
- Downloads last month
- 4
Model tree for 6rightjade/expguard-1.5b
Base model
Qwen/Qwen2.5-1.5B