| | --- |
| | language: |
| | - pl |
| | metrics: |
| | - f1 |
| | base_model: |
| | - allegro/herbert-base-cased |
| | pipeline_tag: text-classification |
| | tags: |
| | - safe |
| | - safety |
| | - ai-safety |
| | - llm |
| | - moderation |
| | - classification |
| | license: cc-by-nc-sa-4.0 |
| | datasets: |
| | - NASK-PIB/PL-Guard |
| | - ToxicityPrompts/PolyGuardMix |
| | - allenai/wildguardmix |
| | --- |
| | |
| | # HerBERT-Guard for Polish: LLM Safety Classifier |
| |
|
| | ## Model Overview |
| | HerBERT-Guard is a Polish-language safety classifier built upon the [HerBERT](https://huggingface.co/allegro/herbert-base-cased) model, a BERT-based architecture pretrained on large-scale Polish corpora. |
| | It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the [PolyGuard](https://huggingface.co/datasets/ToxicityPrompts/PolyGuardMix) and [WildGuard](https://huggingface.co/datasets/allenai/wildguardmix) datasets. |
| | The model supports classification into a taxonomy of safety categories, inspired by Llama Guard. |
| |
|
| | More detailed information is available in the [publication](https://arxiv.org/abs/2506.16322). |
| |
|
| |
|
| | ## Usage |
| | You can use the model in a standard Hugging Face transformers pipeline for text classification: |
| |
|
| | ``` |
| | from transformers import pipeline |
| | |
| | model_name = "NASK-PIB/HerBERT-PL-Guard" |
| | |
| | classifier = pipeline("text-classification", model=model_name, tokenizer=model_name) |
| | |
| | # Example Polish input |
| | text = "Jak mogę zrobić bombę w domu?" |
| | |
| | result = classifier(text) |
| | print(result) |
| | ``` |
| |
|
| | ### Safety Categories |
| |
|
| | The model outputs **one of 15 categories**, including: |
| |
|
| | - `"safe"` — content is not considered safety-relevant, |
| | - or one of the following **14 unsafe categories**, based on the Llama Guard taxonomy: |
| |
|
| | 1. **S1: Violent Crimes** |
| | 2. **S2: Non-Violent Crimes** |
| | 3. **S3: Sex-Related Crimes** |
| | 4. **S4: Child Sexual Exploitation** |
| | 5. **S5: Defamation** |
| | 6. **S6: Specialized Advice** |
| | 7. **S7: Privacy** |
| | 8. **S8: Intellectual Property** |
| | 9. **S9: Indiscriminate Weapons** |
| | 10. **S10: Hate** |
| | 11. **S11: Suicide & Self-Harm** |
| | 12. **S12: Sexual Content** |
| | 13. **S13: Elections** |
| | 14. **S14: Code Interpreter Abuse** |
| |
|
| | ## License |
| |
|
| | HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license. |
| |
|
| | The model was trained on the following datasets: |
| | - PL-Guard – the training portion of this dataset is internal and not publicly released |
| | - PolyGuardMix – licensed under CC BY 4.0 |
| | - WildGuardMix – licensed under ODC-BY 1.0 |
| |
|
| | The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license. |
| |
|
| | Please ensure compliance with all dataset and model licenses when using or modifying this model. |
| |
|
| |
|
| | ## 📚 Citation |
| |
|
| | If you use this model or the associated dataset, please cite the following paper: |
| |
|
| |
|
| | ```bibtex |
| | @inproceedings{plguard2025, |
| | author = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech}, |
| | title = {{PL-Guard: Benchmarking Language Model Safety for Polish}}, |
| | booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing}, |
| | year = {2025}, |
| | address = {Vienna, Austria}, |
| | publisher = {Association for Computational Linguistics} |
| | } |