redact: on-device multilingual PII redaction

Detects and redacts personal data (names, addresses, emails, phone numbers, cards, IBANs, national IDs and more) in text across all 24 official EU languages (Latin, Greek and Cyrillic scripts). A BIOES token classifier plus a portable, dependency-free deterministic layer for structured IDs. The deployable model is ~13.7 MB (int4 ONNX); with the tokenizer the total on-device footprint is ~16 MB.

"Call Anna Kovács at anna@example.hu, IBAN GB29NWBK60161331926819" → "Call [GIVEN_NAME] [SURNAME] at [EMAIL], IBAN [BANK_ACCOUNT]"

Try it

Live demo: desert-ant-labs/redact-demo: paste text and watch PII get highlighted or masked, fully in your browser.
iOS / macOS / tvOS / visionOS: redact-swift: the Swift SDK (Swift Package Manager) with a built-in demo app. It bundles the compiled Core ML model below.

import Redact

let redact = Redact()
let r = try await redact.redaction(of: "Email Anna Kovács at anna@example.hu.")
r.redactedText   // "Email [GIVEN_NAME_1] [SURNAME_1] at [EMAIL_1]."

Android and Web SDKs are on the way.

Files

File	Format	Size	Contents
`redact.onnx`	ONNX (int4, opset 21)	~13.7 MB	4-bit-quantized model, batch=1, ready for on-device runtimes
`redact.mlmodelc`	Compiled Core ML (4-bit)	~11.6 MB	Palettized model, ready to load on Apple platforms (used by `redact-swift`)
`redact.pt`	PyTorch checkpoint	~90 MB	Full-precision weights + config (for retraining / other runtimes)
`config.json`	JSON	tiny	Transformer + label config
`tokenizer.json`, `tokenizer_config.json`	JSON	~2.3 MB	EU-trimmed (31,475-piece) SentencePiece tokenizer (XLM-R lineage)
`labels.json`	JSON	tiny	BIOES `id2label` / `label2id`
`redact_meta.json`	JSON	tiny	Public labels, deterministic-owner labels, recommended thresholds, base-model info

Taxonomy (20 public labels)

GIVEN_NAME, SURNAME, STREET_NAME, BUILDING_NUMBER, SECONDARY_ADDRESS, CITY, STATE, ZIP_CODE, EMAIL, PHONE, CREDIT_CARD, BANK_ACCOUNT, ROUTING_NUMBER, IP_ADDRESS, URL, GOVERNMENT_ID, PASSPORT, DRIVERS_LICENSE, TAX_ID, SSN.

Architecture

Encoder: Multilingual-MiniLM (XLM-R lineage) truncated to 6 layers with an EU-script-trimmed vocab (~23 M params), fine-tuned for BIOES tagging.
Deterministic layer: a pure-stdlib post-processor owns high-confidence structured labels (email, URL, IP, card, IBAN, SSN, routing, tax id, government id, passport) with validation (Luhn, mod-97 IBAN, checksums) and reconciles them with the model's contextual predictions.
Recommended runtime: min_score = 0.6, max_length = 256, stride = 64.

Benchmark

On a fair, all-label, 24-language evaluation (external WikiANN + MultiNERD for names/places, plus a neutral format-valid structured-PII set): fair composite leak-safe recall 88.8, typed F1 85.4, strict F1 71.0, best among on-device models by a wide margin, with better labeling accuracy and precision than models 100×+ its size.

License

Desert Ant Labs Source-Available License. Free for most apps; a commercial license is required at scale. Full terms are at the link. Licensing: licensing@desertant.ai.

Downloads last month: -

desert-ant-labs
/

redact