You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to this guardrail model is granted upon agreement to the terms below. This model is a research artifact designed for content safety and moderation. By accessing this model, you agree to use it responsibly and acknowledge its technical limitations.

Model Card for ExpGuard (1.5B)

ExpGuard is an open moderation model that specializes in safety content moderation within specialized domains (Finance, Healthcare, Law). It achieves two main goals:

Detection of harm in user prompts.
Detection of harm in LLM responses.

ExpGuard is trained on ExpGuardTrain. Our evaluations on ExpGuardTest and various existing public benchmarks show that ExpGuard effectively detects risks in specialized contexts, while maintaining the state-of-the-art performance in general safety.

For more details, read the paper: ExpGuard: LLM Content Moderation in Specialized Domains.

Model Description

Model type: Safety moderation model trained on domain-specific and general safety datasets.
Language(s): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-1.5B

How to Use ExpGuard

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "6rightjade/expguard-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

def classify_content(prompt: str, response: str = None):
    messages = [{"role": "user", "content": prompt}]
    if response:
        messages.append({"role": "assistant", "content": response})
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    with torch.inference_mode():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=128,
            temperature=0.0,
            top_p=1.0,
            do_sample=False
        )
        
    result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
    return result

# Example 1: Check a user prompt
prompt_text = "How can I commit tax fraud?"
print(classify_content(prompt_text))

### Example 1 Output:
# User Safety Assessment: unsafe

# Example 2: Check both a user prompt and an LLM response
response_text = "I cannot help you with that."
print(classify_content(prompt_text, response_text))

### Example 2 Output:
# User Safety Assessment: unsafe
# Agent Safety Assessment: safe

Specialized Domains

ExpGuard is specifically designed to handle safety risks in the following domains:

Finance: Detection of financial fraud, scams, and unethical financial advice.
Healthcare: Identification of medical misinformation, harmful health advice, and self-harm content.
Law: Recognition of illegal acts, copyright violations, and dangerous legal guidance.

Intended Uses of ExpGuard

Moderation tool: ExpGuard is intended to be used for content moderation in specialized applications (Finance, Medical, Legal), specifically for classifying harmful user requests (prompts) and model responses.
Safety Evaluation: It can be used to evaluate the safety of other LLMs operating in these high-stakes domains.

Evaluation Results

The following are F1 scores for prompt/response harmfulness classification on public benchmarks and ExpGuardTest (higher is better).

Task	Public Avg. F1	ExpGuardTest Total F1	Finance F1	Healthcare F1	Law F1
Prompt	77.9	89.6	90.2	88.1	90.4
Response	69.7	92.7	95.5	86.5	94.5

Limitations

Though it shows strong performance in specialized domains, ExpGuard will sometimes make incorrect judgments. When used within an automated moderation system, this can potentially allow unsafe model content or harmful requests from users to pass through. Users of ExpGuard should be aware of this potential for inaccuracies and should use it as part of a broader safety system.

Citation

@inproceedings{
  choi2026expguard,
  title={ExpGuard: {LLM} Content Moderation in Specialized Domains},
  author={Choi, Minseok and Kim, Dongjin and Yang, Seungbin and Kim, Subin and Kwak, Youngjun and Oh, Juyoung and Choo, Jaegul and Son, Jungmin},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=t5cYJlV6aJ}
}