PLAnonimizer / README.md
OskarBartoszyk's picture
Upload 9 files
e6a30a3 verified
---
language: pl
license: apache-2.0
tags:
- ner
- anonymization
- rodo
- gdpr
- polish
- herbert
- medical
- privacy
---
# HerBERT NER – RODO / GDPR Anonymization (PL)
Fine-tuned **HerBERT** model for **Named Entity Recognition (NER)** focused on
automatic anonymization of **RODO / GDPR-sensitive data** in **Polish-language texts**,
with emphasis on medical and administrative documents.
The model is designed to detect personal data, sensitive attributes,
identifiers, and contact information to support downstream anonymization pipelines.
---
## ✨ Use case
- Anonymization of medical documentation
- GDPR / RODO compliance
- Text preprocessing before analytics or ML
- Research on privacy-preserving NLP for Polish
---
## 🧠 Model details
- Base model: **HerBERT (Polish BERT)**
- Task: Token classification (NER)
- Labeling scheme: **BIO**
- Training: Supervised fine-tuning
- Framework: 🤗 Transformers + PyTorch
---
## 🏷️ Supported entity classes
### Personal identifiers
- `NAME` – first names
- `SURNAME` – last names
- `AGE` – age
- `DATE_OF_BIRTH` – date of birth
- `DATE` – other dates identifying events
- `SEX` – sex / gender (explicit)
- `RELIGION` – religion
- `POLITICAL_VIEW` – political views
- `ETHNICITY` – ethnicity / nationality
- `SEXUAL_ORIENTATION` – sexual orientation
- `HEALTH` – health-related information
- `RELATIVE` – family relations revealing identity
### Contact & location
- `CITY` – city (general context)
- `ADDRESS` – full address
- `EMAIL` – email addresses
- `PHONE` – phone numbers
### Identifiers
- `PESEL` – Polish national ID number
- `DOCUMENT_NUMBER` – ID / passport / license numbers
### Professional & education
- `COMPANY` – employer
- `SCHOOL_NAME` – school name
- `JOB_TITLE` – job position
### Financial
- `BANK_ACCOUNT` – bank account numbers
- `CREDIT_CARD_NUMBER` – payment cards
### Digital identifiers
- `USERNAME` – usernames / logins
- `SECRET` – passwords, API keys
---
## 📊 Evaluation
- Validation F1-score: **~0.80**
- Precision-oriented configuration (privacy-first)
- Evaluation performed on held-out validation set
> Note: For anonymization tasks, recall is prioritized to minimize data leakage.
---