metadata
language: pl
license: apache-2.0
tags:
- ner
- anonymization
- rodo
- gdpr
- polish
- herbert
- medical
- privacy
HerBERT NER – RODO / GDPR Anonymization (PL)
Fine-tuned HerBERT model for Named Entity Recognition (NER) focused on automatic anonymization of RODO / GDPR-sensitive data in Polish-language texts, with emphasis on medical and administrative documents.
The model is designed to detect personal data, sensitive attributes, identifiers, and contact information to support downstream anonymization pipelines.
✨ Use case
- Anonymization of medical documentation
- GDPR / RODO compliance
- Text preprocessing before analytics or ML
- Research on privacy-preserving NLP for Polish
🧠 Model details
- Base model: HerBERT (Polish BERT)
- Task: Token classification (NER)
- Labeling scheme: BIO
- Training: Supervised fine-tuning
- Framework: 🤗 Transformers + PyTorch
🏷️ Supported entity classes
Personal identifiers
NAME– first namesSURNAME– last namesAGE– ageDATE_OF_BIRTH– date of birthDATE– other dates identifying eventsSEX– sex / gender (explicit)RELIGION– religionPOLITICAL_VIEW– political viewsETHNICITY– ethnicity / nationalitySEXUAL_ORIENTATION– sexual orientationHEALTH– health-related informationRELATIVE– family relations revealing identity
Contact & location
CITY– city (general context)ADDRESS– full addressEMAIL– email addressesPHONE– phone numbers
Identifiers
PESEL– Polish national ID numberDOCUMENT_NUMBER– ID / passport / license numbers
Professional & education
COMPANY– employerSCHOOL_NAME– school nameJOB_TITLE– job position
Financial
BANK_ACCOUNT– bank account numbersCREDIT_CARD_NUMBER– payment cards
Digital identifiers
USERNAME– usernames / loginsSECRET– passwords, API keys
📊 Evaluation
- Validation F1-score: ~0.80
- Precision-oriented configuration (privacy-first)
- Evaluation performed on held-out validation set
Note: For anonymization tasks, recall is prioritized to minimize data leakage.