Token Classification
Transformers
Safetensors
openai_privacy_filter
pii
ner
privacy
redaction
multilingual
openmed
openai-privacy-filter
Instructions to use OpenMed/privacy-filter-multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMed/privacy-filter-multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="OpenMed/privacy-filter-multilingual")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("OpenMed/privacy-filter-multilingual") model = AutoModelForTokenClassification.from_pretrained("OpenMed/privacy-filter-multilingual") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: transformers | |
| base_model: openai/privacy-filter | |
| datasets: | |
| - ai4privacy/pii-masking-200k | |
| - ai4privacy/pii-masking-400k | |
| - ai4privacy/open-pii-masking-500k-ai4privacy | |
| pipeline_tag: token-classification | |
| tags: | |
| - token-classification | |
| - pii | |
| - ner | |
| - privacy | |
| - redaction | |
| - multilingual | |
| - openmed | |
| - openai-privacy-filter | |
| language: | |
| - ar | |
| - bn | |
| - de | |
| - en | |
| - es | |
| - fr | |
| - hi | |
| - it | |
| - ja | |
| - ko | |
| - nl | |
| - pt | |
| - te | |
| - tr | |
| - vi | |
| - zh | |
| # privacy-filter-multilingual | |
| Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) | |
| for **fine-grained PII extraction** across **54 categories** in **16 languages**. | |
| - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) โ 1.4B-parameter MoE (50M active per token), BIOES token-classification head | |
| - **Task**: Token classification for PII detection (BIOES scheme) | |
| - **Languages (16)**: Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese | |
| - **Training data**: Multilingual mix from [AI4Privacy](https://huggingface.co/ai4privacy) โ `pii-masking-200k`, `pii-masking-400k`, and `open-pii-masking-500k-ai4privacy`, language-balanced | |
| - **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) โ full fine-tune, AdamW, balanced language sampling, 5 epochs, bf16 | |
| - **Labels**: 54 PII categories โ 217 BIOES classes (1 `O` + 54 ร B/I/E/S) | |
| The base model ships with 8 coarse PII categories and English-only training. This | |
| model trades that for a **6.75ร more granular vocabulary** spanning identity, | |
| contact, address, financial, vehicle, digital, and crypto labels โ all evaluated | |
| across 16 languages. | |
| > **Family at a glance.** Same architecture, three runtimes: | |
| > - **PyTorch (this repo)** โ CPU + CUDA, anywhere transformers runs. | |
| > - **MLX BF16** โ [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ Apple Silicon, full precision. | |
| > - **MLX 8-bit** โ [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ Apple Silicon, smaller + faster. | |
| ## Quick start | |
| ### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ recommended | |
| OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi | |
| decoding, span refinement, and a Faker-backed obfuscation engine. Same call | |
| on every host โ Apple Silicon picks up MLX automatically; everywhere else uses | |
| this PyTorch checkpoint. | |
| ```bash | |
| pip install -U "openmed[hf]" | |
| ``` | |
| ```python | |
| from openmed import extract_pii, deidentify | |
| text = ( | |
| "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, " | |
| "phone 415-555-0123, email sarah.johnson@example.com." | |
| ) | |
| # Extract grouped entity spans | |
| result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual") | |
| for ent in result.entities: | |
| print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}") | |
| # De-identify with any of the supported methods | |
| masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-multilingual") | |
| removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-multilingual") | |
| hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-multilingual") | |
| # Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed | |
| fake = deidentify( | |
| text, | |
| method="replace", | |
| model_name="OpenMed/privacy-filter-multilingual", | |
| consistent=True, | |
| seed=42, | |
| ) | |
| print(fake.deidentified_text) | |
| ``` | |
| `OpenMed/privacy-filter-multilingual-mlx*` model names also work in the same | |
| `extract_pii()` / `deidentify()` calls โ on a non-Apple-Silicon host they | |
| automatically fall back to **this PyTorch checkpoint** with a one-time warning. | |
| So you can ship MLX names in code and still run on Linux/Windows. | |
| The OpenMed wrapper passes `trust_remote_code=True` for you, runs the model's | |
| own BIOES Viterbi decoder, and skips OpenMed's regex smart-merging (the model | |
| already produces clean spans). | |
| ## Label space (54 categories) | |
| | Category | Typical examples | | |
| |---|---| | |
| | **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` | | |
| | **Contact** | `EMAIL`, `PHONE`, `URL` | | |
| | **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` | | |
| | **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` | | |
| | **Government IDs** | `SSN` | | |
| | **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` | | |
| | **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` | | |
| | **Vehicle** | `VIN`, `VRM` | | |
| | **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` | | |
| | **Auth** | `PASSWORD` | | |
| The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 categories | |
| (4 ร 54 + 1 = 217). The `id2label` mapping is shipped with the model. | |
| ## Limitations & intended use | |
| - **Multilingual but uneven.** Strongest on languages with rich PII training | |
| data (German, Spanish, French, Italian, Hindi, Telugu, English). CJK languages | |
| (Japanese, Korean, Chinese) and some morphologically-marked low-resource | |
| languages remain the main bottleneck on the current training mix. | |
| - **Synthetic training data.** The AI4Privacy datasets are template-synthesized; | |
| real clinical notes, legal documents, and web text may show different | |
| surface forms. For high-stakes deployments, collect a domain-specific eval | |
| set and re-calibrate thresholds. | |
| - **Not a substitute for legal compliance review.** Use alongside a governance | |
| layer (human review, deterministic regex pre-filters, etc.). | |
| - **Not a clinical PHI model.** Healthcare-specific PHI and clinical entity | |
| training is planned as a separate branch. | |
| **Head initialization**: `opf`'s default "copy-from-matching-base" head init. | |
| Of the 217 new BIOES classes, the few with exact base-vocabulary matches | |
| (`O`, `B/I/E/S-account_name`, etc.) were copied directly; the rest were copied | |
| from semantically-adjacent coarse rows and fine-tuned end-to-end. | |
| **Router**: base model has 128 MoE experts per layer with top-4 routing. | |
| Routers were kept trainable during full fine-tuning; no collapse was observed. | |
| ## Credits & Acknowledgements | |
| This model wouldn't exist without two open-source releases โ sincere thanks | |
| to both teams: | |
| - **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter) | |
| (architecture, modeling code, and `opf` training/eval CLI). Everything in | |
| this repo is a fine-tune on top of that release. | |
| - **AI4Privacy** for releasing the multilingual PII masking datasets used as | |
| training data: | |
| [`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k), | |
| [`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k), | |
| [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy). | |
| Additional thanks to the **HuggingFace** team for the `transformers` / | |
| `huggingface_hub` ecosystem this model ships through. | |
| ## License | |
| Apache 2.0. | |
| ## Citation | |
| If you use this model, please cite **this model**, the organization behind it | |
| (**OpenMed**), and the upstream base model + datasets: | |
| ```bibtex | |
| @misc{openmed_privacy_filter_multilingual_2026, | |
| author = {OpenMed}, | |
| title = {{OpenMed/privacy-filter-multilingual}: multilingual fine-grained PII extraction across 16 languages and 54 categories}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-multilingual}} | |
| } | |
| @misc{openmed_2026, | |
| author = {OpenMed}, | |
| title = {{OpenMed}: open models and resources for healthcare NLP}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/OpenMed}} | |
| } | |
| @misc{openai_privacy_filter_2025, | |
| author = {OpenAI}, | |
| title = {{openai/privacy-filter}}, | |
| year = {2025}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/openai/privacy-filter}} | |
| } | |
| @misc{ai4privacy_pii_masking, | |
| author = {AI4Privacy}, | |
| title = {{AI4Privacy PII Masking Datasets}}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/ai4privacy}} | |
| } | |
| ``` | |