Switch license to Apache 2.0

f914f18 verified 13 days ago

8.58 kB

	---
	license: apache-2.0
	library_name: transformers
	base_model: openai/privacy-filter
	datasets:
	- ai4privacy/pii-masking-200k
	- ai4privacy/pii-masking-400k
	- ai4privacy/open-pii-masking-500k-ai4privacy
	pipeline_tag: token-classification
	tags:
	- token-classification
	- pii
	- ner
	- privacy
	- redaction
	- multilingual
	- openmed
	- openai-privacy-filter
	language:
	- ar
	- bn
	- de
	- en
	- es
	- fr
	- hi
	- it
	- ja
	- ko
	- nl
	- pt
	- te
	- tr
	- vi
	- zh
	---

	# privacy-filter-multilingual

	Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
	for fine-grained PII extraction across 54 categories in 16 languages.

	- Base model: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
	- Task: Token classification for PII detection (BIOES scheme)
	- Languages (16): Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese
	- Training data: Multilingual mix from [AI4Privacy](https://huggingface.co/ai4privacy) — `pii-masking-200k`, `pii-masking-400k`, and `open-pii-masking-500k-ai4privacy`, language-balanced
	- Recipe: `opf train` (OpenAI's official fine-tuning CLI) — full fine-tune, AdamW, balanced language sampling, 5 epochs, bf16
	- Labels: 54 PII categories → 217 BIOES classes (1 `O` + 54 × B/I/E/S)

	The base model ships with 8 coarse PII categories and English-only training. This
	model trades that for a 6.75× more granular vocabulary spanning identity,
	contact, address, financial, vehicle, digital, and crypto labels — all evaluated
	across 16 languages.

	> Family at a glance. Same architecture, three runtimes:
	> - PyTorch (this repo) — CPU + CUDA, anywhere transformers runs.
	> - MLX BF16 — [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) — Apple Silicon, full precision.
	> - MLX 8-bit — [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) — Apple Silicon, smaller + faster.

	## Quick start

	### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended

	OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
	decoding, span refinement, and a Faker-backed obfuscation engine. Same call
	on every host — Apple Silicon picks up MLX automatically; everywhere else uses
	this PyTorch checkpoint.

	```bash
	pip install -U "openmed[hf]"
	```

	```python
	from openmed import extract_pii, deidentify

	text = (
	"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
	"phone 415-555-0123, email sarah.johnson@example.com."
	)

	# Extract grouped entity spans
	result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual")
	for ent in result.entities:
	print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")

	# De-identify with any of the supported methods
	masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-multilingual")
	removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-multilingual")
	hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-multilingual")

	# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
	fake = deidentify(
	text,
	method="replace",
	model_name="OpenMed/privacy-filter-multilingual",
	consistent=True,
	seed=42,
	)
	print(fake.deidentified_text)
	```

	`OpenMed/privacy-filter-multilingual-mlx*` model names also work in the same
	`extract_pii()` / `deidentify()` calls — on a non-Apple-Silicon host they
	automatically fall back to this PyTorch checkpoint with a one-time warning.
	So you can ship MLX names in code and still run on Linux/Windows.

	The OpenMed wrapper passes `trust_remote_code=True` for you, runs the model's
	own BIOES Viterbi decoder, and skips OpenMed's regex smart-merging (the model
	already produces clean spans).

	## Label space (54 categories)

	\| Category \| Typical examples \|
	\|---\|---\|
	\| Identity \| `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` \|
	\| Contact \| `EMAIL`, `PHONE`, `URL` \|
	\| Address \| `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` \|
	\| Dates & time \| `DATE`, `DATEOFBIRTH`, `TIME` \|
	\| Government IDs \| `SSN` \|
	\| Financial \| `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` \|
	\| Crypto \| `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` \|
	\| Vehicle \| `VIN`, `VRM` \|
	\| Digital \| `IPADDRESS`, `MACADDRESS`, `IMEI` \|
	\| Auth \| `PASSWORD` \|

	The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 categories
	(4 × 54 + 1 = 217). The `id2label` mapping is shipped with the model.

	## Limitations & intended use

	- Multilingual but uneven. Strongest on languages with rich PII training
	data (German, Spanish, French, Italian, Hindi, Telugu, English). CJK languages
	(Japanese, Korean, Chinese) and some morphologically-marked low-resource
	languages remain the main bottleneck on the current training mix.
	- Synthetic training data. The AI4Privacy datasets are template-synthesized;
	real clinical notes, legal documents, and web text may show different
	surface forms. For high-stakes deployments, collect a domain-specific eval
	set and re-calibrate thresholds.
	- Not a substitute for legal compliance review. Use alongside a governance
	layer (human review, deterministic regex pre-filters, etc.).
	- Not a clinical PHI model. Healthcare-specific PHI and clinical entity
	training is planned as a separate branch.

	Head initialization: `opf`'s default "copy-from-matching-base" head init.
	Of the 217 new BIOES classes, the few with exact base-vocabulary matches
	(`O`, `B/I/E/S-account_name`, etc.) were copied directly; the rest were copied
	from semantically-adjacent coarse rows and fine-tuned end-to-end.

	Router: base model has 128 MoE experts per layer with top-4 routing.
	Routers were kept trainable during full fine-tuning; no collapse was observed.

	## Credits & Acknowledgements

	This model wouldn't exist without two open-source releases — sincere thanks
	to both teams:

	- OpenAI for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
	(architecture, modeling code, and `opf` training/eval CLI). Everything in
	this repo is a fine-tune on top of that release.
	- AI4Privacy for releasing the multilingual PII masking datasets used as
	training data:
	[`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
	[`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
	[`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).

	Additional thanks to the HuggingFace team for the `transformers` /
	`huggingface_hub` ecosystem this model ships through.

	## License

	Apache 2.0.


	## Citation

	If you use this model, please cite this model, the organization behind it
	(OpenMed), and the upstream base model + datasets:

	```bibtex
	@misc{openmed_privacy_filter_multilingual_2026,
	author = {OpenMed},
	title = {{OpenMed/privacy-filter-multilingual}: multilingual fine-grained PII extraction across 16 languages and 54 categories},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-multilingual}}
	}

	@misc{openmed_2026,
	author = {OpenMed},
	title = {{OpenMed}: open models and resources for healthcare NLP},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/OpenMed}}
	}

	@misc{openai_privacy_filter_2025,
	author = {OpenAI},
	title = {{openai/privacy-filter}},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
	}

	@misc{ai4privacy_pii_masking,
	author = {AI4Privacy},
	title = {{AI4Privacy PII Masking Datasets}},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ai4privacy}}
	}
	```