Token Classification
ONNX
bert
pii
redaction
on-device
multilingual

redact: on-device multilingual PII redaction

Detects and redacts personal data (names, addresses, emails, phone numbers, cards, IBANs, national IDs and more) in text across all 24 official EU languages (Latin, Greek and Cyrillic scripts). A BIOES token classifier plus a portable, dependency-free deterministic layer for structured IDs. The deployable model is ~13.7 MB (int4 ONNX); with the tokenizer the total on-device footprint is ~16 MB.

"Call Anna Kovács at anna@example.hu, IBAN GB29NWBK60161331926819""Call [GIVEN_NAME] [SURNAME] at [EMAIL], IBAN [BANK_ACCOUNT]"

Try it

  • Live demo: desert-ant-labs/redact-demo: paste text and watch PII get highlighted or masked, fully in your browser.
  • iOS / macOS / tvOS / visionOS: redact-swift: the Swift SDK (Swift Package Manager) with a built-in demo app. It bundles the compiled Core ML model below.
import Redact

let redact = Redact()
let r = try await redact.redaction(of: "Email Anna Kovács at anna@example.hu.")
r.redactedText   // "Email [GIVEN_NAME_1] [SURNAME_1] at [EMAIL_1]."

Android and Web SDKs are on the way.

Files

File Format Size Contents
redact.onnx ONNX (int4, opset 21) ~13.7 MB 4-bit-quantized model, batch=1, ready for on-device runtimes
redact.mlmodelc Compiled Core ML (4-bit) ~11.6 MB Palettized model, ready to load on Apple platforms (used by redact-swift)
redact.pt PyTorch checkpoint ~90 MB Full-precision weights + config (for retraining / other runtimes)
config.json JSON tiny Transformer + label config
tokenizer.json, tokenizer_config.json JSON ~2.3 MB EU-trimmed (31,475-piece) SentencePiece tokenizer (XLM-R lineage)
labels.json JSON tiny BIOES id2label / label2id
redact_meta.json JSON tiny Public labels, deterministic-owner labels, recommended thresholds, base-model info

Taxonomy (20 public labels)

GIVEN_NAME, SURNAME, STREET_NAME, BUILDING_NUMBER, SECONDARY_ADDRESS, CITY, STATE, ZIP_CODE, EMAIL, PHONE, CREDIT_CARD, BANK_ACCOUNT, ROUTING_NUMBER, IP_ADDRESS, URL, GOVERNMENT_ID, PASSPORT, DRIVERS_LICENSE, TAX_ID, SSN.

Architecture

  • Encoder: Multilingual-MiniLM (XLM-R lineage) truncated to 6 layers with an EU-script-trimmed vocab (~23 M params), fine-tuned for BIOES tagging.
  • Deterministic layer: a pure-stdlib post-processor owns high-confidence structured labels (email, URL, IP, card, IBAN, SSN, routing, tax id, government id, passport) with validation (Luhn, mod-97 IBAN, checksums) and reconciles them with the model's contextual predictions.
  • Recommended runtime: min_score = 0.6, max_length = 256, stride = 64.

Benchmark

On a fair, all-label, 24-language evaluation (external WikiANN + MultiNERD for names/places, plus a neutral format-valid structured-PII set): fair composite leak-safe recall 88.8, typed F1 85.4, strict F1 71.0, best among on-device models by a wide margin, with better labeling accuracy and precision than models 100×+ its size.

License

Desert Ant Labs Source-Available License. Free for most apps; a commercial license is required at scale. Full terms are at the link. Licensing: licensing@desertant.ai.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using desert-ant-labs/redact 1