NASK-PIB
/

HerBERT-PL-Guard

Text Classification

Model card Files Files and versions

HerBERT-PL-Guard / README.md

WojciechKusa's picture

Update README.md

0ae9fbf verified 8 months ago

|

history blame contribute delete

3.18 kB

	---
	language:
	- pl
	metrics:
	- f1
	base_model:
	- allegro/herbert-base-cased
	pipeline_tag: text-classification
	tags:
	- safe
	- safety
	- ai-safety
	- llm
	- moderation
	- classification
	license: cc-by-nc-sa-4.0
	datasets:
	- NASK-PIB/PL-Guard
	- ToxicityPrompts/PolyGuardMix
	- allenai/wildguardmix
	---

	# HerBERT-Guard for Polish: LLM Safety Classifier

	## Model Overview
	HerBERT-Guard is a Polish-language safety classifier built upon the [HerBERT](https://huggingface.co/allegro/herbert-base-cased) model, a BERT-based architecture pretrained on large-scale Polish corpora.
	It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the [PolyGuard](https://huggingface.co/datasets/ToxicityPrompts/PolyGuardMix) and [WildGuard](https://huggingface.co/datasets/allenai/wildguardmix) datasets.
	The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.

	More detailed information is available in the [publication](https://arxiv.org/abs/2506.16322).


	## Usage
	You can use the model in a standard Hugging Face transformers pipeline for text classification:

	```
	from transformers import pipeline

	model_name = "NASK-PIB/HerBERT-PL-Guard"

	classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)

	# Example Polish input
	text = "Jak mogę zrobić bombę w domu?"

	result = classifier(text)
	print(result)
	```

	### Safety Categories

	The model outputs one of 15 categories, including:

	- `"safe"` — content is not considered safety-relevant,
	- or one of the following 14 unsafe categories, based on the Llama Guard taxonomy:

	1. S1: Violent Crimes
	2. S2: Non-Violent Crimes
	3. S3: Sex-Related Crimes
	4. S4: Child Sexual Exploitation
	5. S5: Defamation
	6. S6: Specialized Advice
	7. S7: Privacy
	8. S8: Intellectual Property
	9. S9: Indiscriminate Weapons
	10. S10: Hate
	11. S11: Suicide & Self-Harm
	12. S12: Sexual Content
	13. S13: Elections
	14. S14: Code Interpreter Abuse

	## License

	HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.

	The model was trained on the following datasets:
	- PL-Guard – the training portion of this dataset is internal and not publicly released
	- PolyGuardMix – licensed under CC BY 4.0
	- WildGuardMix – licensed under ODC-BY 1.0

	The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.

	Please ensure compliance with all dataset and model licenses when using or modifying this model.


	## 📚 Citation

	If you use this model or the associated dataset, please cite the following paper:


	```bibtex
	@inproceedings{plguard2025,
	author = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
	title = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
	booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
	year = {2025},
	address = {Vienna, Austria},
	publisher = {Association for Computational Linguistics}
	}