eu-pii-anonimization

**Polish PII and Sensitive Data Detection Model **

bardsai/eu-pii-anonimization is a production-oriented token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity data in Polish-language text. It is designed for privacy-preserving NLP workflows, document redaction, secure data processing, and compliance-driven preprocessing pipelines.

Built on top of XLM-RoBERTa-base, this model identifies a broad set of sensitive entities across personal, financial, identity, health, geolocation, authentication, and special-category personal data. The model is intended for organizations that need reliable Polish-first detection of sensitive content before indexing, analytics, model training, or downstream AI processing.

This is version v0.1 of this model; we will continue to refine and improve it.

Key Highlights

  • Language support: Polish
  • Task: Token classification
  • Base model: XLM-RoBERTa-base
  • Global F1 score: 95%
  • Entity schema: 35 sensitive-data classes

Intended Use

The model is designed for automated detection and labeling of sensitive spans in Polish text, including both classic PII and higher-risk regulated categories.

Typical use cases include:

  • PII redaction in documents, tickets, emails, and chat logs
  • Dataset sanitization before model training, annotation, or analytics
  • Compliance workflows supporting GDPR-oriented processing controls
  • Enterprise data governance and sensitive-content discovery
  • Document intake pipelines for finance, healthcare, legal, HR, and public-sector use cases
  • Pre-ingestion filtering for search, retrieval, and RAG systems

Supported Language

  • Polish (pl)

The model is optimized for Polish-language content, including business, administrative, operational, and user-generated text. Real-world performance may vary depending on domain vocabulary, formatting quality, OCR noise, abbreviations, and annotation conventions.

Detected Entity Types

Model detects the following 35 classes:

Personal Identity and Profile

  • PERSON_NAME
  • PERSON_IDENTIFIER
  • DATE_OF_BIRTH
  • PERSON_ATTRIBUTE
  • PERSON_ALIAS
  • PERSON_ROLE_OR_TITLE

Organizations

  • ORGANIZATION_NAME
  • ORGANIZATION_IDENTIFIER

Contact and Location

  • EMAIL_ADDRESS
  • PHONE_NUMBER
  • CONTACT_HANDLE
  • POSTAL_ADDRESS
  • LOCATION
  • GEO_LOCATION

Technical and Digital Identifiers

  • IP_ADDRESS
  • DEVICE_IDENTIFIER
  • COOKIE_IDENTIFIER
  • ACCOUNT_IDENTIFIER
  • AUTH_SECRET

Financial and Commercial Data

  • BANK_ACCOUNT_IDENTIFIER
  • PAYMENT_CARD
  • PAYMENT_CARD_SECURITY
  • FINANCIAL_AMOUNT
  • INCOME_COMPENSATION

Documents and Official References

  • DOCUMENT_REFERENCE
  • VEHICLE_IDENTIFIER

Health and Biometric Data

  • HEALTH_DATA
  • GENETIC_DATA
  • BIOMETRIC_DATA

Special-Category and Highly Sensitive Personal Data

  • RELIGION_OR_BELIEF
  • POLITICAL_OPINION
  • SEXUAL_ORIENTATION
  • TRADE_UNION_MEMBERSHIP
  • ETHNIC_ORIGIN
  • CRIMINAL_OFFENCE_DATA

Model Architecture

  • Architecture family: Transformer
  • Base checkpoint: FacebookAI/xlm-roberta-base
  • Fine-tuning task: Token classification / named-entity-style span labeling
  • Primary focus: Detection of PII and regulated sensitive entities in Polish

As a token classification model, it predicts entity labels at the token level and supports span reconstruction through standard BIO-style or token-aligned post-processing, depending on the deployment pipeline.

Quick Start

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "Jan Kowalski, PESEL 85010112345, tel 123 456 789"

inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

input_ids = inputs["input_ids"][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Limitations

While this model is optimized for robust Polish PII detection, the following limitations should be considered:

  • The model may be sensitive to annotation-policy differences between organizations.
  • Ambiguous phrases may require context-aware post-processing.
  • OCR artifacts, broken formatting, or token fragmentation can reduce span accuracy.
  • Some classes, especially sensitive contextual categories, may require human validation in compliance-heavy environments.
  • The model should not be treated as a standalone legal or compliance decision-maker.

About bards.ai

At bards.ai, we focus on providing machine learning expertise and skills to our partners, particularly in the areas of NLP, machine vision, and time series analysis. Our team is located in Wroclaw, Poland. Please visit our website for more information: bards.ai.

Let us know if you use our model :). Also, if you need any help, feel free to contact us at info@bards.ai

Downloads last month
146
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bardsai/eu-pii-anonimization

Quantized
(9)
this model