PLAnonimizer / README.md
OskarBartoszyk's picture
Upload 9 files
e6a30a3 verified
metadata
language: pl
license: apache-2.0
tags:
  - ner
  - anonymization
  - rodo
  - gdpr
  - polish
  - herbert
  - medical
  - privacy

HerBERT NER – RODO / GDPR Anonymization (PL)

Fine-tuned HerBERT model for Named Entity Recognition (NER) focused on automatic anonymization of RODO / GDPR-sensitive data in Polish-language texts, with emphasis on medical and administrative documents.

The model is designed to detect personal data, sensitive attributes, identifiers, and contact information to support downstream anonymization pipelines.


✨ Use case

  • Anonymization of medical documentation
  • GDPR / RODO compliance
  • Text preprocessing before analytics or ML
  • Research on privacy-preserving NLP for Polish

🧠 Model details

  • Base model: HerBERT (Polish BERT)
  • Task: Token classification (NER)
  • Labeling scheme: BIO
  • Training: Supervised fine-tuning
  • Framework: 🤗 Transformers + PyTorch

🏷️ Supported entity classes

Personal identifiers

  • NAME – first names
  • SURNAME – last names
  • AGE – age
  • DATE_OF_BIRTH – date of birth
  • DATE – other dates identifying events
  • SEX – sex / gender (explicit)
  • RELIGION – religion
  • POLITICAL_VIEW – political views
  • ETHNICITY – ethnicity / nationality
  • SEXUAL_ORIENTATION – sexual orientation
  • HEALTH – health-related information
  • RELATIVE – family relations revealing identity

Contact & location

  • CITY – city (general context)
  • ADDRESS – full address
  • EMAIL – email addresses
  • PHONE – phone numbers

Identifiers

  • PESEL – Polish national ID number
  • DOCUMENT_NUMBER – ID / passport / license numbers

Professional & education

  • COMPANY – employer
  • SCHOOL_NAME – school name
  • JOB_TITLE – job position

Financial

  • BANK_ACCOUNT – bank account numbers
  • CREDIT_CARD_NUMBER – payment cards

Digital identifiers

  • USERNAME – usernames / logins
  • SECRET – passwords, API keys

📊 Evaluation

  • Validation F1-score: ~0.80
  • Precision-oriented configuration (privacy-first)
  • Evaluation performed on held-out validation set

Note: For anonymization tasks, recall is prioritized to minimize data leakage.