expguard-7b / README.md

6rightjade

Update README.md

17ba6e3 25 days ago

preview code

raw

history blame contribute delete

5.79 kB

metadata

license: apache-2.0
datasets:
  - 6rightjade/expguardmix
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
language:
  - en
tags:
  - classifier
  - safety
  - moderation
  - llm
  - expguard
  - domain-specific
extra_gated_prompt: >-
  Access to this guardrail model is granted upon agreement to the terms below. 
  This model is a research artifact designed for content safety and moderation. 
  By accessing this model, you agree to use it responsibly and acknowledge its
  technical limitations.
extra_gated_fields:
  Name: text
  Organization or Affiliation: text
  Country: text
  Email: text
  Intended use of the model: text
  I acknowledge that this model is probabilistic and may produce false positives (flagging safe content) or false negatives (missing harmful content): checkbox
  I understand that this model should not be the sole determinant for critical decisions affecting user rights without human oversight: checkbox
  I agree not to use this model for malicious purposes, including reverse-engineering it to develop adversarial attacks or to bypass safety filters: checkbox
  I certify that the information I have provided is true and accurate: checkbox

Model Card for ExpGuard (7B)

ExpGuard is an open moderation model that specializes in safety content moderation within specialized domains (Finance, Healthcare, Law). It achieves two main goals:

Detection of harm in user prompts.
Detection of harm in LLM responses.

ExpGuard is trained on ExpGuardTrain. Our evaluations on ExpGuardTest and various existing public benchmarks show that ExpGuard effectively detects risks in specialized contexts, while maintaining the state-of-the-art performance in general safety.

For more details, read the paper: ExpGuard: LLM Content Moderation in Specialized Domains.

Model Description

Model type: Safety moderation model trained on domain-specific and general safety datasets.
Language(s): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-7B

How to Use ExpGuard

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "6rightjade/expguard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

def classify_content(prompt: str, response: str = None):
    messages = [{"role": "user", "content": prompt}]
    if response:
        messages.append({"role": "assistant", "content": response})
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    with torch.inference_mode():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=128,
            temperature=0.0,
            top_p=1.0,
            do_sample=False
        )
        
    result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
    return result

# Example 1: Check a user prompt
prompt_text = "How can I commit tax fraud?"
print(classify_content(prompt_text))

### Example 1 Output:
# User Safety Assessment: unsafe

# Example 2: Check both a user prompt and an LLM response
response_text = "I cannot help you with that."
print(classify_content(prompt_text, response_text))

### Example 2 Output:
# User Safety Assessment: unsafe
# Agent Safety Assessment: safe

Specialized Domains

ExpGuard is specifically designed to handle safety risks in the following domains:

Finance: Detection of financial fraud, scams, and unethical financial advice.
Healthcare: Identification of medical misinformation, harmful health advice, and self-harm content.
Law: Recognition of illegal acts, copyright violations, and dangerous legal guidance.

Intended Uses of ExpGuard

Moderation tool: ExpGuard is intended to be used for content moderation in specialized applications (Finance, Medical, Legal), specifically for classifying harmful user requests (prompts) and model responses.
Safety Evaluation: It can be used to evaluate the safety of other LLMs operating in these high-stakes domains.

Evaluation Results

The following are F1 scores for prompt/response harmfulness classification on public benchmarks and ExpGuardTest (higher is better).

Task	Public Avg. F1	ExpGuardTest Total F1	Finance F1	Healthcare F1	Law F1
Prompt	85.7	93.3	94.1	91.3	94.8
Response	78.5	92.7	96.7	86.0	92.4

Limitations

Though it shows strong performance in specialized domains, ExpGuard will sometimes make incorrect judgments. When used within an automated moderation system, this can potentially allow unsafe model content or harmful requests from users to pass through. Users of ExpGuard should be aware of this potential for inaccuracies and should use it as part of a broader safety system.

Citation

@inproceedings{
  choi2026expguard,
  title={ExpGuard: {LLM} Content Moderation in Specialized Domains},
  author={Choi, Minseok and Kim, Dongjin and Yang, Seungbin and Kim, Subin and Kwak, Youngjun and Oh, Juyoung and Choo, Jaegul and Son, Jungmin},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=t5cYJlV6aJ}
}