EU PII Anonimization Multilingual Detector

A multilingual PII detector built for EU compliance reality: 24 official languages, GDPR special categories, AI Act high-risk data — in one model.

Try it

🔗 Live demo in Browser

Why this exists

GDPR and the AI Act require you to detect and redact personal data across every language your users write in. Most open-source PII models were trained on English with a few translations bolted on, and they cover the basics — names, emails, phone numbers — while missing exactly the categories regulators care about: biometric data, genetic data, health information, political opinions, ethnic origin.

bardsai/eu-pii-anonimization-multilang is trained end-to-end on real multilingual data (not English-translated), covers 36 entity classes mapped to GDPR Article 9 special categories and AI Act high-risk identifiers, and ships with quantized ONNX weights so you can run it in production pipelines without GPU infrastructure.

What's different

Native multilingual training. Real text in EU languages. Performance on Polish, German, French, Italian, and Spanish is comparable to the English baseline.
GDPR special categories covered. Health, biometric, genetic, and other Article 9 entities that most OSS PII models skip entirely.
Production-ready. ONNX export and INT8 quantized weights included. Runs on CPU at latencies that work inside RAG ingestion or real-time redaction pipelines.

Who this is for

Compliance and privacy engineers at EU companies who need to:

Redact PII from documents, support tickets, emails, and chat logs before storage or analysis
Sanitize datasets before training, sharing, or moving across jurisdictions
Filter inputs to RAG pipelines and search indexes so personal data doesn't leak into prompts or logs
Build audit trails for what was redacted, when, and why

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization-multilang"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

For production, use the quantized ONNX weights in onnx/model_quantized.onnx — same outputs, ~4x smaller, CPU-friendly latency.

Entity coverage

36 classes across eight families, with B-/I- BIO labeling:

Personal identity — names, dates of birth, national ID numbers
Contact and location — addresses, emails, phone numbers, geolocation
Official documents — passports, driver's licenses, tax IDs
Financial — IBAN, credit card, account numbers
Technical identifiers — IP addresses, MAC addresses, device IDs, usernames
Organization data — employer, institutional affiliations
Health, biometric, genetic (GDPR Art. 9) — medical conditions, biometric identifiers, genetic data
Special-category (GDPR Art. 9) — racial/ethnic origin, political opinions, religious beliefs, sexual orientation, trade union membership

Full label list in config.json (id2label / label2id).

Limitations

This is a model, not a compliance program. A few things to keep in mind:

Performance varies by language, domain, and input quality. OCR noise, code-switching, and unusual formatting will degrade recall.
Ambiguous mentions (a name that's also a place, an ID-shaped number that isn't an ID) need post-processing rules or human review.
Detection ≠ legal sufficiency. Use this to support a redaction workflow, not to replace your DPO's judgment.
Threshold tuning matters. The right operating point depends on whether you're optimizing for recall (compliance) or precision (data utility).

Files

model.safetensors — model weights
config.json — config and label mapping
tokenizer.json, tokenizer_config.json — tokenizer assets
onnx/model.onnx — ONNX export
onnx/model_quantized.onnx — INT8 quantized for CPU production
training_args.bin — training metadata

Citation

@misc{bards.ai_2026,
    author       = { bards.ai and Michał Swędrowski and Michał Pogoda-Rosikoń and Karol Samorański },
    title        = { eu-pii-anonimization-multilang (Revision 6de9f68) },
    year         = 2026,
    url          = { https://huggingface.co/bardsai/eu-pii-anonimization-multilang },
    doi          = { 10.57967/hf/8721 },
    publisher    = { Hugging Face }
}

About bards.ai

We build product ML for teams shipping AI to real users — RAG, agents, fine-tuned models, evals, and the unglamorous infrastructure that keeps them working. 16+ open models on Hugging Face, 10+ publications, production deployments at Comcast, Chili Piper, and Surfer SEO.

bards.ai

Downloads last month: 484

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for bardsai/eu-pii-anonimization-multilang

Base model

FacebookAI/xlm-roberta-base

Quantized

(13)

this model

Quantizations

1 model

bardsai
/

eu-pii-anonimization-multilang