EU PII Anonimization Multilingual Detector

A multilingual PII detector built for EU compliance reality: 24 official languages, GDPR special categories, AI Act high-risk data — in one model.

Zrzut ekranu 2026-05-5 o 08.47.14

Try it

🔗 Live demo in Browser

Why this exists

GDPR and the AI Act require you to detect and redact personal data across every language your users write in. Most open-source PII models were trained on English with a few translations bolted on, and they cover the basics — names, emails, phone numbers — while missing exactly the categories regulators care about: biometric data, genetic data, health information, political opinions, ethnic origin.

bardsai/eu-pii-anonimization-multilang is trained end-to-end on real multilingual data (not English-translated), covers 36 entity classes mapped to GDPR Article 9 special categories and AI Act high-risk identifiers, and ships with quantized ONNX weights so you can run it in production pipelines without GPU infrastructure.

What's different

  • Native multilingual training. Real text in EU languages. Performance on Polish, German, French, Italian, and Spanish is comparable to the English baseline.
  • GDPR special categories covered. Health, biometric, genetic, and other Article 9 entities that most OSS PII models skip entirely.
  • Production-ready. ONNX export and INT8 quantized weights included. Runs on CPU at latencies that work inside RAG ingestion or real-time redaction pipelines.

Who this is for

Compliance and privacy engineers at EU companies who need to:

  • Redact PII from documents, support tickets, emails, and chat logs before storage or analysis
  • Sanitize datasets before training, sharing, or moving across jurisdictions
  • Filter inputs to RAG pipelines and search indexes so personal data doesn't leak into prompts or logs
  • Build audit trails for what was redacted, when, and why

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization-multilang"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

For production, use the quantized ONNX weights in onnx/model_quantized.onnx — same outputs, ~4x smaller, CPU-friendly latency.

Entity coverage

36 classes across eight families, with B-/I- BIO labeling:

  • Personal identity — names, dates of birth, national ID numbers
  • Contact and location — addresses, emails, phone numbers, geolocation
  • Official documents — passports, driver's licenses, tax IDs
  • Financial — IBAN, credit card, account numbers
  • Technical identifiers — IP addresses, MAC addresses, device IDs, usernames
  • Organization data — employer, institutional affiliations
  • Health, biometric, genetic (GDPR Art. 9) — medical conditions, biometric identifiers, genetic data
  • Special-category (GDPR Art. 9) — racial/ethnic origin, political opinions, religious beliefs, sexual orientation, trade union membership

Full label list in config.json (id2label / label2id).

Limitations

This is a model, not a compliance program. A few things to keep in mind:

  • Performance varies by language, domain, and input quality. OCR noise, code-switching, and unusual formatting will degrade recall.
  • Ambiguous mentions (a name that's also a place, an ID-shaped number that isn't an ID) need post-processing rules or human review.
  • Detection ≠ legal sufficiency. Use this to support a redaction workflow, not to replace your DPO's judgment.
  • Threshold tuning matters. The right operating point depends on whether you're optimizing for recall (compliance) or precision (data utility).

Files

  • model.safetensors — model weights
  • config.json — config and label mapping
  • tokenizer.json, tokenizer_config.json — tokenizer assets
  • onnx/model.onnx — ONNX export
  • onnx/model_quantized.onnx — INT8 quantized for CPU production
  • training_args.bin — training metadata

Citation

@misc{bards.ai_2026,
    author       = { bards.ai and Michał Swędrowski and Michał Pogoda-Rosikoń and Karol Samorański },
    title        = { eu-pii-anonimization-multilang (Revision 6de9f68) },
    year         = 2026,
    url          = { https://huggingface.co/bardsai/eu-pii-anonimization-multilang },
    doi          = { 10.57967/hf/8721 },
    publisher    = { Hugging Face }
}

About bards.ai

We build product ML for teams shipping AI to real users — RAG, agents, fine-tuned models, evals, and the unglamorous infrastructure that keeps them working. 16+ open models on Hugging Face, 10+ publications, production deployments at Comcast, Chili Piper, and Surfer SEO.

bards.ai

Downloads last month
484
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for bardsai/eu-pii-anonimization-multilang

Quantized
(13)
this model
Quantizations
1 model

Spaces using bardsai/eu-pii-anonimization-multilang 2