model card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,71 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- jl3676/SafetyAnalystData
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- safety
|
| 9 |
+
- moderation
|
| 10 |
+
- llm
|
| 11 |
+
- lm
|
| 12 |
+
- harmfulness
|
| 13 |
---
|
| 14 |
+
# Model Card for HarmReporter
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
HarmReporter is an open language model that generates a structured "harm tree" for a given prompt. The harm tree consists of the following features:
|
| 18 |
+
1) stakeholders (individuals, groups, communities, and entities) that may be impacted by the prompt scenario,
|
| 19 |
+
2) categories of harmful *actions* that may impact each stakeholder,
|
| 20 |
+
3) categories of harmful *effect* each harmful action may cause on the stakeholder, and
|
| 21 |
+
4) the *likelihood*, *severity*, and *immediacy* of each harmful effect.
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
HarmReporter is a 8B model trained on [SafetyAnalystData](https://huggingface.co/datasets/jl3676/SafetyAnalystData). Its output can be combined with the output of [BenefitReporter](https://huggingface.co/jl3676/BenefitReporter) to generate a comprehensive harm-benefit tree for a given prompt. Collectively, HarmReporter and BenefitReporter are named SafetyReporter.
|
| 25 |
+
|
| 26 |
+
Our comprehensive evaluations on five existing public prompt harmfulness classification benchmarks show that SafetyReporter outperforms the strongest existing open-source baselines (including WildGuard, Llama-Guard-3, and ShieldGemma) on the average F1 score across benchmarks, surpassing WildGuard by 3.7%, Llama-Guard-3 by 10.8%, and ShieldGemma by over 30%.
|
| 27 |
+
|
| 28 |
+
For more details, read the paper: [SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation](https://arxiv.org/abs/2410.16665v1).
|
| 29 |
+
|
| 30 |
+
## Model Description
|
| 31 |
+
|
| 32 |
+
- **Model type:** Harmful feature generation model trained on synthetic datasets from a mixture of SOTA LLMs.
|
| 33 |
+
- **Language(s):** English
|
| 34 |
+
- **License:** Apache 2.0
|
| 35 |
+
- **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
## How to Use HarmReporter
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 42 |
+
model_id = "jl3676/HarmReporter"
|
| 43 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 44 |
+
model = AutoModelForCausalLM.from_pretrained(model_id)
|
| 45 |
+
text_input = ["How can I rob the bank?"]
|
| 46 |
+
input_tokenized = tokenizer.apply_chat_template(text_input, return_tensors="pt")
|
| 47 |
+
output = model.generate(input_tokenized, max_new_tokens=19000)
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Intended Uses of HarmReporter
|
| 51 |
+
|
| 52 |
+
- Harmfulness analysis: HarmReporter can be used to analyze the harmfulness of a given prompt in the hypothetical scenario that an AI language model provides a helpful answer to the prompt. It can be used to generate a structured harm tree for a given prompt, which can be used to identify potential stakeholders, and harmful actions and effects.
|
| 53 |
+
- Moderation tool: HarmReporter's output (harm tree) can be combined with the output of [BenefitReporter](https://huggingface.co/jl3676/BenefitReporter) into a comprehensive harm-benefit tree for a given prompt. These features can be aggregated using our [aggregation algorithm](https://github.com/jl3676/SafetyAnalyst) into a harmfulness score, which can be used as a moderation tool to identify potentially harmful prompts.
|
| 54 |
+
|
| 55 |
+
## Limitations
|
| 56 |
+
|
| 57 |
+
Though it shows state-of-the-art performance on prompt safety classification, HarmReporter will sometimes generate inaccurate features and the aggregated harmfulness score may not always lead to correct judgments. Users of HarmReporter should be aware of this potential for inaccuracies.
|
| 58 |
+
|
| 59 |
+
## Citation
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
@misc{li2024safetyanalystinterpretabletransparentsteerable,
|
| 63 |
+
title={SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation},
|
| 64 |
+
author={Jing-Jing Li and Valentina Pyatkin and Max Kleiman-Weiner and Liwei Jiang and Nouha Dziri and Anne G. E. Collins and Jana Schaich Borg and Maarten Sap and Yejin Choi and Sydney Levine},
|
| 65 |
+
year={2024},
|
| 66 |
+
eprint={2410.16665},
|
| 67 |
+
archivePrefix={arXiv},
|
| 68 |
+
primaryClass={cs.CL},
|
| 69 |
+
url={https://arxiv.org/abs/2410.16665},
|
| 70 |
+
}
|
| 71 |
+
```
|