eu-pii-anonimization-multilang

Multilingual PII and Sensitive Data Detection Model

bardsai/eu-pii-anonimization-multilang is a token classification model for detecting personally identifiable information (PII) and other regulated or high-sensitivity entities in multilingual text.

Built on top of XLM-RoBERTa-base, this model is intended for privacy-preserving NLP workflows, data redaction, secure preprocessing, and compliance-focused pipelines across multiple European languages.

Key Highlights

  • Language support: Multilingual (EU-focused)
  • Task: Token classification
  • Base model: XLM-RoBERTa-base
  • Entity schema: 36 sensitive-data classes (B-/I- labeling)

Intended Use

Typical use cases:

  • PII redaction in documents, tickets, emails, and chat logs
  • Dataset sanitization before training, analytics, or sharing
  • Compliance and governance pipelines for sensitive data handling
  • Pre-ingestion filtering for search, retrieval, and RAG systems

Detected Entity Types

The model predicts the following sensitive entity families:

  • Personal identity and profile data
  • Organization and institutional identifiers
  • Contact details and location data
  • Technical and digital identifiers
  • Financial and commercial information
  • Official document references
  • Health, biometric, and genetic data
  • Special-category personal data

Labels are defined in config.json (id2label and label2id).

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "bardsai/eu-pii-anonimization-multilang"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "John Smith, passport AB123456, phone +48 123 456 789"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(label, token)

Repository Files

  • model.safetensors - model weights
  • config.json - model config and label mapping
  • tokenizer.json, tokenizer_config.json - tokenizer assets
  • training_args.bin - training metadata
  • onnx/model.onnx - exported ONNX model
  • onnx/model_quantized.onnx - INT8 quantized ONNX model

Limitations

  • Performance can vary by language, domain, formatting quality, and OCR noise.
  • Ambiguous phrases may require post-processing and human validation.
  • The model should support compliance workflows, not replace legal decisions.

About bards.ai

At bards.ai, we build practical ML systems for NLP, vision, and time series.
More info: https://bards.ai

Downloads last month
120
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bardsai/eu-pii-anonimization-multilang

Quantized
(10)
this model