--- language: pl license: apache-2.0 tags: - ner - anonymization - rodo - gdpr - polish - herbert - medical - privacy --- # HerBERT NER – RODO / GDPR Anonymization (PL) Fine-tuned **HerBERT** model for **Named Entity Recognition (NER)** focused on automatic anonymization of **RODO / GDPR-sensitive data** in **Polish-language texts**, with emphasis on medical and administrative documents. The model is designed to detect personal data, sensitive attributes, identifiers, and contact information to support downstream anonymization pipelines. --- ## ✨ Use case - Anonymization of medical documentation - GDPR / RODO compliance - Text preprocessing before analytics or ML - Research on privacy-preserving NLP for Polish --- ## 🧠 Model details - Base model: **HerBERT (Polish BERT)** - Task: Token classification (NER) - Labeling scheme: **BIO** - Training: Supervised fine-tuning - Framework: 🤗 Transformers + PyTorch --- ## 🏷️ Supported entity classes ### Personal identifiers - `NAME` – first names - `SURNAME` – last names - `AGE` – age - `DATE_OF_BIRTH` – date of birth - `DATE` – other dates identifying events - `SEX` – sex / gender (explicit) - `RELIGION` – religion - `POLITICAL_VIEW` – political views - `ETHNICITY` – ethnicity / nationality - `SEXUAL_ORIENTATION` – sexual orientation - `HEALTH` – health-related information - `RELATIVE` – family relations revealing identity ### Contact & location - `CITY` – city (general context) - `ADDRESS` – full address - `EMAIL` – email addresses - `PHONE` – phone numbers ### Identifiers - `PESEL` – Polish national ID number - `DOCUMENT_NUMBER` – ID / passport / license numbers ### Professional & education - `COMPANY` – employer - `SCHOOL_NAME` – school name - `JOB_TITLE` – job position ### Financial - `BANK_ACCOUNT` – bank account numbers - `CREDIT_CARD_NUMBER` – payment cards ### Digital identifiers - `USERNAME` – usernames / logins - `SECRET` – passwords, API keys --- ## 📊 Evaluation - Validation F1-score: **~0.80** - Precision-oriented configuration (privacy-first) - Evaluation performed on held-out validation set > Note: For anonymization tasks, recall is prioritized to minimize data leakage. ---