Polish PII Classification Model

This model is a fine-tuned version of polish-roberta-8k, specifically designed for Named Entity Recognition (NER) focused on identifying Personally Identifiable Information (PII) in Polish text.

The model identifies entities such as names, locations, organizations, and contact numbers, making it ideal for anonymization pipelines and data privacy tasks.

Model Details

  • Architecture: RobertaForTokenClassification
  • Base Model: polish-roberta-8k
  • Language: Polish
  • Task: Token Classification (NER)
  • Context Length: Up to 8192 tokens (inherited from base model)

Label Mapping

The model classifies tokens into the following 9 categories:

ID Label Description
0 O Outside any entity
1 CONTACT/NUM Phone numbers, emails, etc.
2 EVENT Named events
3 FACILITY Buildings, airports, bridges, etc.
4 LOCATION Cities, countries, regions
5 ORGANIZATION Companies, institutions, groups
6 OTHER Miscellaneous PII
7 PERSON People's names
8 PRODUCT Brand names, products

Training Data

The model was trained on a generalized version of the KPWr-NER dataset (clarin-pl/kpwr-ner).

Fine-grained labels from the original dataset were mapped to the general categories listed above using a custom mapping process to improve the model's robustness and utility for general anonymization.

Training Hyperparameters

Hyperparameter Value
Epochs 5
Batch Size (Train/Eval) 16 / 16
Learning Rate 5e-5
Weight Decay 0.01
Max Sequence Length 128
Seed 42
Optimization Metric f1_macro

Evaluation Results

The model achieved the following results on the evaluation set:

Metric Value
F1 Macro 0.8430
F1 Weighted 0.9784
Precision Macro 0.8480
Precision Weighted 0.9786
Recall Macro 0.8461
Recall Weighted 0.9787
Eval Loss 0.1099

Usage

1. Standard Transformers Pipeline

You can use the model directly via the Hugging Face transformers library:

from transformers import pipeline

# Load the model and tokenizer

model_id = "radlab/pii-pl-v1.0"
pii_pipeline = pipeline("ner", model=model_id, tokenizer=model_id, aggregation_strategy="simple")

text = "Jan Kowalski z firmy PKO BP mieszka w Warszawie i dowiedza ciocię Jadzię w Warszawie razem z synem Dominikiem."
results = pii_pipeline(text)

for entity in results:
    print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")

expected output

Entity: Jan Kowalski | Label: PERSON | Score: 0.9981
Entity: PKO BP | Label: ORGANIZATION | Score: 0.9959
Entity: Warszawie | Label: LOCATION | Score: 0.9974
Entity: Jadzię | Label: PERSON | Score: 0.9955
Entity: Warszawie | Label: LOCATION | Score: 0.9974
Entity: Dominikiem | Label: PERSON | Score: 0.9942

2. Using the pii_classification Inference Wrapper

For advanced features like punctuation cleaning, sub-token merging, and full anonymization, use the AnonPredictor from the official repository.

Installation:

git clone https://github.com/radlab-dev-group/anonymizer-model.git
cd anonymizer-model
pip install . --break-system-packages

Inference Code:

from pii_classification.inference.inference import AnonPredictor

# Initialize predictor
predictor = AnonPredictor(model_path="radlab/pii-pl-v1.0")

text = "Jan Kowalski z firmy PKO BP mieszka w Warszawie i dowiedza ciocię Jadzię w Warszawie razem z synem Dominikiem."

# 1. Get detailed predictions (merged entities)
predictions = predictor.predict(
    text, clean_punct=True, merge_entities=True, handle_gaps=True
)
print(predictions)

# 2. Full anonymization
anonymized = predictor.predict_and_anonymize(text)
print(f"Anonymized Text: {anonymized['text']}")
print(f"Mappings: {anonymized['mappings']}")

expected output

{'0': 'O', '1': 'CONTACT/NUM', '2': 'EVENT', '3': 'FACILITY', '4': 'LOCATION', '5': 'ORGANIZATION', '6': 'OTHER', '7': 'PERSON', '8': 'PRODUCT'}
[{'word': 'Jan Kowalski', 'label': 'PERSON'}, {'word': ' ', 'label': 'O'}, {'word': 'z', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'firmy', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'PKO BP', 'label': 'ORGANIZATION'}, {'word': ' ', 'label': 'O'}, {'word': 'mieszka', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'w', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Warszawie', 'label': 'LOCATION'}, {'word': ' ', 'label': 'O'}, {'word': 'i', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'dowiedza', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'ciocię', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Jadzię', 'label': 'PERSON'}, {'word': ' ', 'label': 'O'}, {'word': 'w', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Warszawie', 'label': 'LOCATION'}, {'word': ' ', 'label': 'O'}, {'word': 'razem', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'z', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'synem', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Dominikiem', 'label': 'PERSON'}, {'word': '.', 'label': 'O'}]
predictions= [{'word': 'Jan Kowalski', 'label': 'PERSON'}, {'word': ' ', 'label': 'O'}, {'word': 'z', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'firmy', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'PKO BP', 'label': 'ORGANIZATION'}, {'word': ' ', 'label': 'O'}, {'word': 'mieszka', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'w', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Warszawie', 'label': 'LOCATION'}, {'word': ' ', 'label': 'O'}, {'word': 'i', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'dowiedza', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'ciocię', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Jadzię', 'label': 'PERSON'}, {'word': ' ', 'label': 'O'}, {'word': 'w', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Warszawie', 'label': 'LOCATION'}, {'word': ' ', 'label': 'O'}, {'word': 'razem', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'z', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'synem', 'label': 'O'}, {'word': ' ', 'label': 'O'}, {'word': 'Dominikiem', 'label': 'PERSON'}, {'word': '.', 'label': 'O'}]
Anonymized Text: {PERSON_1776877218745866_1} z firmy {ORGANIZATION_1776877218745866_2} mieszka w {LOCATION_1776877218745866_3} i dowiedza ciocię {PERSON_1776877218745866_4} w {LOCATION_1776877218745866_5} razem z synem {PERSON_1776877218745866_6}.
Mappings: {'PERSON_1776877218745866_1': 'Jan Kowalski', 'ORGANIZATION_1776877218745866_2': 'PKO BP', 'LOCATION_1776877218745866_3': 'Warszawie', 'PERSON_1776877218745866_4': 'Jadzię', 'LOCATION_1776877218745866_5': 'Warszawie', 'PERSON_1776877218745866_6': 'Dominikiem'}

Repository and Code

The full training pipeline, data conversion scripts, and inference API are available here: 👉 https://github.com/radlab-dev-group/anonymizer-model

Licensing

Downloads last month
68
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for radlab/pii-pl-v1.0

Finetuned
(8)
this model

Dataset used to train radlab/pii-pl-v1.0

Collection including radlab/pii-pl-v1.0