HerBERT NER – RODO / GDPR Anonymization (PL)

Fine-tuned HerBERT model for Named Entity Recognition (NER) focused on automatic anonymization of RODO / GDPR-sensitive data in Polish-language texts, with emphasis on medical and administrative documents.

The model is designed to detect personal data, sensitive attributes, identifiers, and contact information to support downstream anonymization pipelines.


✨ Use case

  • Anonymization of medical documentation
  • GDPR / RODO compliance
  • Text preprocessing before analytics or ML
  • Research on privacy-preserving NLP for Polish

🧠 Model details

  • Base model: HerBERT (Polish BERT)
  • Task: Token classification (NER)
  • Labeling scheme: BIO
  • Training: Supervised fine-tuning
  • Framework: πŸ€— Transformers + PyTorch

🏷️ Supported entity classes

Personal identifiers

  • NAME – first names
  • SURNAME – last names
  • AGE – age
  • DATE_OF_BIRTH – date of birth
  • DATE – other dates identifying events
  • SEX – sex / gender (explicit)
  • RELIGION – religion
  • POLITICAL_VIEW – political views
  • ETHNICITY – ethnicity / nationality
  • SEXUAL_ORIENTATION – sexual orientation
  • HEALTH – health-related information
  • RELATIVE – family relations revealing identity

Contact & location

  • CITY – city (general context)
  • ADDRESS – full address
  • EMAIL – email addresses
  • PHONE – phone numbers

Identifiers

  • PESEL – Polish national ID number
  • DOCUMENT_NUMBER – ID / passport / license numbers

Professional & education

  • COMPANY – employer
  • SCHOOL_NAME – school name
  • JOB_TITLE – job position

Financial

  • BANK_ACCOUNT – bank account numbers
  • CREDIT_CARD_NUMBER – payment cards

Digital identifiers

  • USERNAME – usernames / logins
  • SECRET – passwords, API keys

πŸ“Š Evaluation

  • Validation F1-score: ~0.80
  • Precision-oriented configuration (privacy-first)
  • Evaluation performed on held-out validation set

Note: For anonymization tasks, recall is prioritized to minimize data leakage.


Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support