|
|
---
|
|
|
language: pl
|
|
|
license: apache-2.0
|
|
|
tags:
|
|
|
- ner
|
|
|
- anonymization
|
|
|
- rodo
|
|
|
- gdpr
|
|
|
- polish
|
|
|
- herbert
|
|
|
- medical
|
|
|
- privacy
|
|
|
---
|
|
|
|
|
|
# HerBERT NER – RODO / GDPR Anonymization (PL)
|
|
|
|
|
|
Fine-tuned **HerBERT** model for **Named Entity Recognition (NER)** focused on
|
|
|
automatic anonymization of **RODO / GDPR-sensitive data** in **Polish-language texts**,
|
|
|
with emphasis on medical and administrative documents.
|
|
|
|
|
|
The model is designed to detect personal data, sensitive attributes,
|
|
|
identifiers, and contact information to support downstream anonymization pipelines.
|
|
|
|
|
|
---
|
|
|
|
|
|
## ✨ Use case
|
|
|
|
|
|
- Anonymization of medical documentation
|
|
|
- GDPR / RODO compliance
|
|
|
- Text preprocessing before analytics or ML
|
|
|
- Research on privacy-preserving NLP for Polish
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🧠 Model details
|
|
|
|
|
|
- Base model: **HerBERT (Polish BERT)**
|
|
|
- Task: Token classification (NER)
|
|
|
- Labeling scheme: **BIO**
|
|
|
- Training: Supervised fine-tuning
|
|
|
- Framework: 🤗 Transformers + PyTorch
|
|
|
|
|
|
---
|
|
|
|
|
|
## 🏷️ Supported entity classes
|
|
|
|
|
|
### Personal identifiers
|
|
|
- `NAME` – first names
|
|
|
- `SURNAME` – last names
|
|
|
- `AGE` – age
|
|
|
- `DATE_OF_BIRTH` – date of birth
|
|
|
- `DATE` – other dates identifying events
|
|
|
- `SEX` – sex / gender (explicit)
|
|
|
- `RELIGION` – religion
|
|
|
- `POLITICAL_VIEW` – political views
|
|
|
- `ETHNICITY` – ethnicity / nationality
|
|
|
- `SEXUAL_ORIENTATION` – sexual orientation
|
|
|
- `HEALTH` – health-related information
|
|
|
- `RELATIVE` – family relations revealing identity
|
|
|
|
|
|
### Contact & location
|
|
|
- `CITY` – city (general context)
|
|
|
- `ADDRESS` – full address
|
|
|
- `EMAIL` – email addresses
|
|
|
- `PHONE` – phone numbers
|
|
|
|
|
|
### Identifiers
|
|
|
- `PESEL` – Polish national ID number
|
|
|
- `DOCUMENT_NUMBER` – ID / passport / license numbers
|
|
|
|
|
|
### Professional & education
|
|
|
- `COMPANY` – employer
|
|
|
- `SCHOOL_NAME` – school name
|
|
|
- `JOB_TITLE` – job position
|
|
|
|
|
|
### Financial
|
|
|
- `BANK_ACCOUNT` – bank account numbers
|
|
|
- `CREDIT_CARD_NUMBER` – payment cards
|
|
|
|
|
|
### Digital identifiers
|
|
|
- `USERNAME` – usernames / logins
|
|
|
- `SECRET` – passwords, API keys
|
|
|
|
|
|
---
|
|
|
|
|
|
## 📊 Evaluation
|
|
|
|
|
|
- Validation F1-score: **~0.80**
|
|
|
- Precision-oriented configuration (privacy-first)
|
|
|
- Evaluation performed on held-out validation set
|
|
|
|
|
|
> Note: For anonymization tasks, recall is prioritized to minimize data leakage.
|
|
|
|
|
|
---
|
|
|
|
|
|
|