eu-pii-anonimization

**Polish PII and Sensitive Data Detection Model **

bardsai/eu-pii-anonimization is a production-oriented token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity data in Polish-language text. It is designed for privacy-preserving NLP workflows, document redaction, secure data processing, and compliance-driven preprocessing pipelines.

Built on top of XLM-RoBERTa-base, this model identifies a broad set of sensitive entities across personal, financial, identity, health, geolocation, authentication, and special-category personal data. The model is intended for organizations that need reliable Polish-first detection of sensitive content before indexing, analytics, model training, or downstream AI processing.

This is version v0.1 of this model; we will continue to refine and improve it.

Key Highlights

Language support: Polish
Task: Token classification
Base model: XLM-RoBERTa-base
Global F1 score: 95%
Entity schema: 35 sensitive-data classes

Intended Use

The model is designed for automated detection and labeling of sensitive spans in Polish text, including both classic PII and higher-risk regulated categories.

Typical use cases include:

PII redaction in documents, tickets, emails, and chat logs
Dataset sanitization before model training, annotation, or analytics
Compliance workflows supporting GDPR-oriented processing controls
Enterprise data governance and sensitive-content discovery
Document intake pipelines for finance, healthcare, legal, HR, and public-sector use cases
Pre-ingestion filtering for search, retrieval, and RAG systems

Supported Language

Polish (pl)

The model is optimized for Polish-language content, including business, administrative, operational, and user-generated text. Real-world performance may vary depending on domain vocabulary, formatting quality, OCR noise, abbreviations, and annotation conventions.

Detected Entity Types

Model detects the following 35 classes:

Personal Identity and Profile

PERSON_NAME
PERSON_IDENTIFIER
DATE_OF_BIRTH
PERSON_ATTRIBUTE
PERSON_ALIAS
PERSON_ROLE_OR_TITLE

Organizations

ORGANIZATION_NAME
ORGANIZATION_IDENTIFIER

Contact and Location

EMAIL_ADDRESS
PHONE_NUMBER
CONTACT_HANDLE
POSTAL_ADDRESS
LOCATION
GEO_LOCATION

Technical and Digital Identifiers

IP_ADDRESS
DEVICE_IDENTIFIER
COOKIE_IDENTIFIER
ACCOUNT_IDENTIFIER
AUTH_SECRET

Financial and Commercial Data

BANK_ACCOUNT_IDENTIFIER
PAYMENT_CARD
PAYMENT_CARD_SECURITY
FINANCIAL_AMOUNT
INCOME_COMPENSATION

Documents and Official References

DOCUMENT_REFERENCE
VEHICLE_IDENTIFIER

Health and Biometric Data

HEALTH_DATA
GENETIC_DATA
BIOMETRIC_DATA

Special-Category and Highly Sensitive Personal Data

RELIGION_OR_BELIEF
POLITICAL_OPINION
SEXUAL_ORIENTATION
TRADE_UNION_MEMBERSHIP
ETHNIC_ORIGIN
CRIMINAL_OFFENCE_DATA

Model Architecture

Architecture family: Transformer
Base checkpoint: FacebookAI/xlm-roberta-base
Fine-tuning task: Token classification / named-entity-style span labeling
Primary focus: Detection of PII and regulated sensitive entities in Polish

As a token classification model, it predicts entity labels at the token level and supports span reconstruction through standard BIO-style or token-aligned post-processing, depending on the deployment pipeline.

Quick Start

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 85010112345, tel 123 456 789"

inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

input_ids = inputs["input_ids"][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Limitations

While this model is optimized for robust Polish PII detection, the following limitations should be considered:

The model may be sensitive to annotation-policy differences between organizations.
Ambiguous phrases may require context-aware post-processing.
OCR artifacts, broken formatting, or token fragmentation can reduce span accuracy.
Some classes, especially sensitive contextual categories, may require human validation in compliance-heavy environments.
The model should not be treated as a standalone legal or compliance decision-maker.

About bards.ai

At bards.ai, we focus on providing machine learning expertise and skills to our partners, particularly in the areas of NLP, machine vision, and time series analysis. Our team is located in Wroclaw, Poland. Please visit our website for more information: bards.ai.

Let us know if you use our model :). Also, if you need any help, feel free to contact us at info@bards.ai

Downloads last month: 146

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for bardsai/eu-pii-anonimization

Base model

FacebookAI/xlm-roberta-base

Quantized

(9)

this model