PLVeil: Deep Anonymization NER Model for Polish
PLVeil is a highly effective, state-of-the-art Named Entity Recognition (NER) model designed specifically for the deep anonymization of Polish-language text documents. It is built upon the robust HerBERT Transformer architecture and strictly fine-tuned to identify and mask sensitive personal data in compliance with strict GDPR regulations.
The model is optimized for processing highly structured, linguistic-rich, and heavily inflected Polish medical, legal, and office documents.
Model Description
- Architecture: Transformer (HerBERT Polish language model basis)
- Task: Token Classification / Named Entity Recognition (NER)
- Language: Polish (
pl) - Primary Use Case: Automated removal/masking of sensitive entities (pseudonymization & anonymization routines).
Recognized Entities
PLVeil detects a wide range of sensitive data clusters, such as:
- Persons (First names, last names, multi-part names)
- Locations (Cities, street addresses)
- Organizations/Companies
- Contact Info (Email addresses, Phone numbers)
- Dates & Age (Birthdates, precise timing)
- Family relations - National Identifiers
Performance and Metrics
During the experimental evaluation and model fine-tuning, the best trade-off between Precision and Recall was achieved precisely at 8 epochs of training. Stopping at the 8th epoch effectively maximized the model's accuracy while successfully avoiding the overfitting effect.
- Overall F1-Score: Over 98%
Engineering Advancements
PLVeil incorporates specific architectural optimizations for handling real-world Polish documents:
Extended Context Window (Sliding Window Mechanics) The model processes texts with an increased
max_lengthparameter parameter set to 256 and a predefinedstride(overlap) of 64. This completely eliminates the long-standing issue of cutting sequences or entities (such as multi-part last names or city names) precisely at the border of context chunks in multi-page documents.Advanced Token Aggregation Strategy PLVeil utilizes a
simpleaggregation strategy designed strictly to follow the BIO boundary tokens (B-andI-) proposed by the neural network. This radically resolves an issue natively observed in Polish NLP models where punctuation marks (like commas or colons) seamlessly sticking to an entity were "swallowed" and wrongly masked.Intelligent Distance Entity Merging Through adjusting adjacency distance settings from +1 to +2, the detection engine can smoothly concatenate split but related tokens (e.g., first and last name separated by a double space or a comma and space) into a single, cohesive entity bounding box.
Intended Use
PLVeil is strictly intended to power data privacy pipelines. Its goal is to provide developers, researchers, and enterprises a solid baseline to convert risky, personalized datasets into safe, pseudonymized text blocks that can be stored securely or analyzed without exposing individuals.
How to use
You can easily instantiate the inference using the transformers pipeline for Token Classification:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("OskarBartoszyk/PLVeil")
model = AutoModelForTokenClassification.from_pretrained("OskarBartoszyk/PLVeil")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
example = "Prezydentem jest dzisiaj Janusz Kowalski, zameldowany w Gnieźnie."
ner_results = nlp(example)
print(ner_results)
- Downloads last month
- 15