redact: on-device multilingual PII redaction
Detects and redacts personal data (names, addresses, emails, phone numbers, cards, IBANs, national IDs and more) in text across all 24 official EU languages (Latin, Greek and Cyrillic scripts). A BIOES token classifier plus a portable, dependency-free deterministic layer for structured IDs. The deployable model is ~13.7 MB (int4 ONNX); with the tokenizer the total on-device footprint is ~16 MB.
"Call Anna Kovács at anna@example.hu, IBAN GB29NWBK60161331926819"→"Call [GIVEN_NAME] [SURNAME] at [EMAIL], IBAN [BANK_ACCOUNT]"
Try it
- Live demo: desert-ant-labs/redact-demo: paste text and watch PII get highlighted or masked, fully in your browser.
- iOS / macOS / tvOS / visionOS:
redact-swift: the Swift SDK (Swift Package Manager) with a built-in demo app. It bundles the compiled Core ML model below.
import Redact
let redact = Redact()
let r = try await redact.redaction(of: "Email Anna Kovács at anna@example.hu.")
r.redactedText // "Email [GIVEN_NAME_1] [SURNAME_1] at [EMAIL_1]."
Android and Web SDKs are on the way.
Files
| File | Format | Size | Contents |
|---|---|---|---|
redact.onnx |
ONNX (int4, opset 21) | ~13.7 MB | 4-bit-quantized model, batch=1, ready for on-device runtimes |
redact.mlmodelc |
Compiled Core ML (4-bit) | ~11.6 MB | Palettized model, ready to load on Apple platforms (used by redact-swift) |
redact.pt |
PyTorch checkpoint | ~90 MB | Full-precision weights + config (for retraining / other runtimes) |
config.json |
JSON | tiny | Transformer + label config |
tokenizer.json, tokenizer_config.json |
JSON | ~2.3 MB | EU-trimmed (31,475-piece) SentencePiece tokenizer (XLM-R lineage) |
labels.json |
JSON | tiny | BIOES id2label / label2id |
redact_meta.json |
JSON | tiny | Public labels, deterministic-owner labels, recommended thresholds, base-model info |
Taxonomy (20 public labels)
GIVEN_NAME, SURNAME, STREET_NAME, BUILDING_NUMBER, SECONDARY_ADDRESS,
CITY, STATE, ZIP_CODE, EMAIL, PHONE, CREDIT_CARD, BANK_ACCOUNT,
ROUTING_NUMBER, IP_ADDRESS, URL, GOVERNMENT_ID, PASSPORT,
DRIVERS_LICENSE, TAX_ID, SSN.
Architecture
- Encoder: Multilingual-MiniLM (XLM-R lineage) truncated to 6 layers with an EU-script-trimmed vocab (~23 M params), fine-tuned for BIOES tagging.
- Deterministic layer: a pure-stdlib post-processor owns high-confidence structured labels (email, URL, IP, card, IBAN, SSN, routing, tax id, government id, passport) with validation (Luhn, mod-97 IBAN, checksums) and reconciles them with the model's contextual predictions.
- Recommended runtime:
min_score = 0.6,max_length = 256,stride = 64.
Benchmark
On a fair, all-label, 24-language evaluation (external WikiANN + MultiNERD for names/places, plus a neutral format-valid structured-PII set): fair composite leak-safe recall 88.8, typed F1 85.4, strict F1 71.0, best among on-device models by a wide margin, with better labeling accuracy and precision than models 100×+ its size.
License
Desert Ant Labs Source-Available License. Free for most apps; a commercial license is required at scale. Full terms are at the link. Licensing: licensing@desertant.ai.
- Downloads last month
- -