Text Classification
Safetensors
Polish
bert
safe
safety
ai-safety
llm
moderation
classification
HerBERT-PL-Guard / README.md
WojciechKusa's picture
Update README.md
0ae9fbf verified
---
language:
- pl
metrics:
- f1
base_model:
- allegro/herbert-base-cased
pipeline_tag: text-classification
tags:
- safe
- safety
- ai-safety
- llm
- moderation
- classification
license: cc-by-nc-sa-4.0
datasets:
- NASK-PIB/PL-Guard
- ToxicityPrompts/PolyGuardMix
- allenai/wildguardmix
---
# HerBERT-Guard for Polish: LLM Safety Classifier
## Model Overview
HerBERT-Guard is a Polish-language safety classifier built upon the [HerBERT](https://huggingface.co/allegro/herbert-base-cased) model, a BERT-based architecture pretrained on large-scale Polish corpora.
It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the [PolyGuard](https://huggingface.co/datasets/ToxicityPrompts/PolyGuardMix) and [WildGuard](https://huggingface.co/datasets/allenai/wildguardmix) datasets.
The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.
More detailed information is available in the [publication](https://arxiv.org/abs/2506.16322).
## Usage
You can use the model in a standard Hugging Face transformers pipeline for text classification:
```
from transformers import pipeline
model_name = "NASK-PIB/HerBERT-PL-Guard"
classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)
# Example Polish input
text = "Jak mogę zrobić bombę w domu?"
result = classifier(text)
print(result)
```
### Safety Categories
The model outputs **one of 15 categories**, including:
- `"safe"` — content is not considered safety-relevant,
- or one of the following **14 unsafe categories**, based on the Llama Guard taxonomy:
1. **S1: Violent Crimes**
2. **S2: Non-Violent Crimes**
3. **S3: Sex-Related Crimes**
4. **S4: Child Sexual Exploitation**
5. **S5: Defamation**
6. **S6: Specialized Advice**
7. **S7: Privacy**
8. **S8: Intellectual Property**
9. **S9: Indiscriminate Weapons**
10. **S10: Hate**
11. **S11: Suicide & Self-Harm**
12. **S12: Sexual Content**
13. **S13: Elections**
14. **S14: Code Interpreter Abuse**
## License
HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.
The model was trained on the following datasets:
- PL-Guard – the training portion of this dataset is internal and not publicly released
- PolyGuardMix – licensed under CC BY 4.0
- WildGuardMix – licensed under ODC-BY 1.0
The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.
Please ensure compliance with all dataset and model licenses when using or modifying this model.
## 📚 Citation
If you use this model or the associated dataset, please cite the following paper:
```bibtex
@inproceedings{plguard2025,
author = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
title = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
year = {2025},
address = {Vienna, Austria},
publisher = {Association for Computational Linguistics}
}