expguard-7b / README.md

Update README.md

17ba6e3 25 days ago

5.79 kB

	---
	license: apache-2.0
	datasets:
	- 6rightjade/expguardmix
	library_name: transformers
	pipeline_tag: text-generation
	base_model: Qwen/Qwen2.5-7B
	language:
	- en
	tags:
	- classifier
	- safety
	- moderation
	- llm
	- expguard
	- domain-specific
	extra_gated_prompt: >-
	Access to this guardrail model is granted upon agreement to the terms below.
	This model is a research artifact designed for content safety and moderation.
	By accessing this model, you agree to use it responsibly and acknowledge its technical limitations.
	extra_gated_fields:
	Name: text
	Organization or Affiliation: text
	Country: text
	Email: text
	Intended use of the model: text
	I acknowledge that this model is probabilistic and may produce false positives (flagging safe content) or false negatives (missing harmful content): checkbox
	I understand that this model should not be the sole determinant for critical decisions affecting user rights without human oversight: checkbox
	I agree not to use this model for malicious purposes, including reverse-engineering it to develop adversarial attacks or to bypass safety filters: checkbox
	I certify that the information I have provided is true and accurate: checkbox
	---

	# Model Card for ExpGuard (7B)

	ExpGuard is an open moderation model that specializes in safety content moderation within specialized domains (Finance, Healthcare, Law). It achieves two main goals:
	1) Detection of harm in user prompts.
	2) Detection of harm in LLM responses.

	ExpGuard is trained on [ExpGuardTrain](https://huggingface.co/datasets/6rightjade/expguardmix). Our evaluations on [ExpGuardTest](https://huggingface.co/datasets/6rightjade/expguardmix) and various existing public benchmarks show that ExpGuard effectively detects risks in specialized contexts, while maintaining the state-of-the-art performance in general safety.

	For more details, read the paper: [ExpGuard: LLM Content Moderation in Specialized Domains](https://openreview.net/forum?id=t5cYJlV6aJ).

	<img src="https://github.com/user-attachments/assets/ab94544a-4314-4d3d-97d2-c59b8a1f8dff" alt="expguard_overview" width="640"/>

	## Model Description

	- Model type: Safety moderation model trained on domain-specific and general safety datasets.
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: [Qwen/Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B)

	## How to Use ExpGuard

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "6rightjade/expguard-7b"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

	def classify_content(prompt: str, response: str = None):
	messages = [{"role": "user", "content": prompt}]
	if response:
	messages.append({"role": "assistant", "content": response})

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	with torch.inference_mode():
	outputs = model.generate(
	**inputs,
	max_new_tokens=128,
	temperature=0.0,
	top_p=1.0,
	do_sample=False
	)

	result = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
	return result

	# Example 1: Check a user prompt
	prompt_text = "How can I commit tax fraud?"
	print(classify_content(prompt_text))

	### Example 1 Output:
	# User Safety Assessment: unsafe

	# Example 2: Check both a user prompt and an LLM response
	response_text = "I cannot help you with that."
	print(classify_content(prompt_text, response_text))

	### Example 2 Output:
	# User Safety Assessment: unsafe
	# Agent Safety Assessment: safe
	```

	## Specialized Domains

	ExpGuard is specifically designed to handle safety risks in the following domains:

	1. Finance: Detection of financial fraud, scams, and unethical financial advice.
	2. Healthcare: Identification of medical misinformation, harmful health advice, and self-harm content.
	3. Law: Recognition of illegal acts, copyright violations, and dangerous legal guidance.

	## Intended Uses of ExpGuard

	- Moderation tool: ExpGuard is intended to be used for content moderation in specialized applications (Finance, Medical, Legal), specifically for classifying harmful user requests (prompts) and model responses.
	- Safety Evaluation: It can be used to evaluate the safety of other LLMs operating in these high-stakes domains.

	## Evaluation Results

	The following are F1 scores for prompt/response harmfulness classification on public benchmarks and ExpGuardTest (higher is better).

	\| Task \| Public Avg. F1 \| ExpGuardTest Total F1 \| Finance F1 \| Healthcare F1 \| Law F1 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| Prompt \| 85.7 \| 93.3 \| 94.1 \| 91.3 \| 94.8 \|
	\| Response \| 78.5 \| 92.7 \| 96.7 \| 86.0 \| 92.4 \|

	## Limitations

	Though it shows strong performance in specialized domains, ExpGuard will sometimes make incorrect judgments. When used within an automated moderation system, this can potentially allow unsafe model content or harmful requests from users to pass through. Users of ExpGuard should be aware of this potential for inaccuracies and should use it as part of a broader safety system.

	## Citation

	```bibtex
	@inproceedings{
	choi2026expguard,
	title={ExpGuard: {LLM} Content Moderation in Specialized Domains},
	author={Choi, Minseok and Kim, Dongjin and Yang, Seungbin and Kim, Subin and Kwak, Youngjun and Oh, Juyoung and Choo, Jaegul and Son, Jungmin},
	booktitle={The Fourteenth International Conference on Learning Representations},
	year={2026},
	url={https://openreview.net/forum?id=t5cYJlV6aJ}
	}
	```