PLAnonimizer / README.md

OskarBartoszyk

Upload 9 files

e6a30a3 verified 12 days ago

preview code

raw

history blame contribute delete

2.4 kB

metadata

language: pl
license: apache-2.0
tags:
  - ner
  - anonymization
  - rodo
  - gdpr
  - polish
  - herbert
  - medical
  - privacy

HerBERT NER – RODO / GDPR Anonymization (PL)

Fine-tuned HerBERT model for Named Entity Recognition (NER) focused on automatic anonymization of RODO / GDPR-sensitive data in Polish-language texts, with emphasis on medical and administrative documents.

The model is designed to detect personal data, sensitive attributes, identifiers, and contact information to support downstream anonymization pipelines.

✨ Use case

Anonymization of medical documentation
GDPR / RODO compliance
Text preprocessing before analytics or ML
Research on privacy-preserving NLP for Polish

🧠 Model details

Base model: HerBERT (Polish BERT)
Task: Token classification (NER)
Labeling scheme: BIO
Training: Supervised fine-tuning
Framework: 🤗 Transformers + PyTorch

🏷️ Supported entity classes

Personal identifiers

NAME – first names
SURNAME – last names
AGE – age
DATE_OF_BIRTH – date of birth
DATE – other dates identifying events
SEX – sex / gender (explicit)
RELIGION – religion
POLITICAL_VIEW – political views
ETHNICITY – ethnicity / nationality
SEXUAL_ORIENTATION – sexual orientation
HEALTH – health-related information
RELATIVE – family relations revealing identity

Contact & location

CITY – city (general context)
ADDRESS – full address
EMAIL – email addresses
PHONE – phone numbers

Identifiers

PESEL – Polish national ID number
DOCUMENT_NUMBER – ID / passport / license numbers

Professional & education

COMPANY – employer
SCHOOL_NAME – school name
JOB_TITLE – job position

Financial

BANK_ACCOUNT – bank account numbers
CREDIT_CARD_NUMBER – payment cards

Digital identifiers

USERNAME – usernames / logins
SECRET – passwords, API keys

📊 Evaluation

Validation F1-score: ~0.80
Precision-oriented configuration (privacy-first)
Evaluation performed on held-out validation set

Note: For anonymization tasks, recall is prioritized to minimize data leakage.