eu-pii-anonimization-multilang
Multilingual PII and Sensitive Data Detection Model
bardsai/eu-pii-anonimization-multilang is a token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity entities in multilingual text.
Built on top of XLM-RoBERTa-base, this model is intended for privacy-preserving NLP workflows, data redaction, secure preprocessing, and compliance-focused pipelines across multiple European languages.
Key Highlights
- Language support: Multilingual (EU-focused)
- Task: Token classification
- Base model: XLM-RoBERTa-base
- Entity schema: 36 sensitive-data classes (
B-/I-labeling)
Intended Use
Typical use cases:
- PII redaction in documents, tickets, emails, and chat logs
- Dataset sanitization before training, analytics, or sharing
- Compliance and governance pipelines for sensitive data handling
- Pre-ingestion filtering for search, retrieval, and RAG systems
Detected Entity Types
The model predicts the following sensitive entity families:
- Personal identity and profile data
- Organization and institutional identifiers
- Contact details and location data
- Technical and digital identifiers
- Financial and commercial information
- Official document references
- Health, biometric, and genetic data
- Special-category personal data
Labels are defined in config.json (id2label and label2id).
Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "bardsai/eu-pii-anonimization-multilang"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if label != "O":
print(label, token)
Repository Files
model.safetensors- model weightsconfig.json- model config and label mappingtokenizer.json,tokenizer_config.json- tokenizer assetstraining_args.bin- training metadataonnx/model.onnx- exported ONNX modelonnx/model_quantized.onnx- INT8 quantized ONNX model
Limitations
- Performance can vary by language, domain, formatting quality, and OCR noise.
- Ambiguous phrases may require post-processing and human validation.
- The model should support compliance workflows, not replace legal decisions.
About bards.ai
At bards.ai, we build practical ML systems for NLP, vision, and time series.
More info: https://bards.ai
- Downloads last month
- 120
Model tree for bardsai/eu-pii-anonimization-multilang
Base model
FacebookAI/xlm-roberta-base