PLVeil: Deep Anonymization NER Model for Polish

PLVeil is a highly effective, state-of-the-art Named Entity Recognition (NER) model designed specifically for the deep anonymization of Polish-language text documents. It is built upon the robust HerBERT Transformer architecture and strictly fine-tuned to identify and mask sensitive personal data in compliance with strict GDPR regulations.

The model is optimized for processing highly structured, linguistic-rich, and heavily inflected Polish medical, legal, and office documents.

Model Description

Architecture: Transformer (HerBERT Polish language model basis)
Task: Token Classification / Named Entity Recognition (NER)
Language: Polish (pl)
Primary Use Case: Automated removal/masking of sensitive entities (pseudonymization & anonymization routines).

Recognized Entities

PLVeil detects a wide range of sensitive data clusters, such as:

Persons (First names, last names, multi-part names)
Locations (Cities, street addresses)
Organizations/Companies
Contact Info (Email addresses, Phone numbers)
Dates & Age (Birthdates, precise timing)
Family relations - National Identifiers

Performance and Metrics

During the experimental evaluation and model fine-tuning, the best trade-off between Precision and Recall was achieved precisely at 8 epochs of training. Stopping at the 8th epoch effectively maximized the model's accuracy while successfully avoiding the overfitting effect.

Overall F1-Score: Over 98%

Engineering Advancements

PLVeil incorporates specific architectural optimizations for handling real-world Polish documents:

Extended Context Window (Sliding Window Mechanics) The model processes texts with an increased max_length parameter parameter set to 256 and a predefined stride (overlap) of 64. This completely eliminates the long-standing issue of cutting sequences or entities (such as multi-part last names or city names) precisely at the border of context chunks in multi-page documents.
Advanced Token Aggregation Strategy PLVeil utilizes a simple aggregation strategy designed strictly to follow the BIO boundary tokens (B- and I-) proposed by the neural network. This radically resolves an issue natively observed in Polish NLP models where punctuation marks (like commas or colons) seamlessly sticking to an entity were "swallowed" and wrongly masked.
Intelligent Distance Entity Merging Through adjusting adjacency distance settings from +1 to +2, the detection engine can smoothly concatenate split but related tokens (e.g., first and last name separated by a double space or a comma and space) into a single, cohesive entity bounding box.

Intended Use

PLVeil is strictly intended to power data privacy pipelines. Its goal is to provide developers, researchers, and enterprises a solid baseline to convert risky, personalized datasets into safe, pseudonymized text blocks that can be stored securely or analyzed without exposing individuals.

How to use

You can easily instantiate the inference using the transformers pipeline for Token Classification:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("OskarBartoszyk/PLVeil")
model = AutoModelForTokenClassification.from_pretrained("OskarBartoszyk/PLVeil")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

example = "Prezydentem jest dzisiaj Janusz Kowalski, zameldowany w Gnieźnie."
ner_results = nlp(example)

print(ner_results)

Downloads last month: 15

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support