Release v2.1 · model + card

d5ffba1 11 days ago

6.2 kB

	---
	license: mit
	language:
	- de
	base_model: openai/privacy-filter
	pipeline_tag: token-classification
	library_name: opf
	tags:
	- pii
	- privacy
	- ner
	- token-classification
	- german
	- de
	- privacy-filter
	- opf
	datasets:
	- ai4privacy/open-pii-masking-500k-ai4privacy
	metrics:
	- f1
	model-index:
	- name: digitflow/privacy-filter-de-ft
	results:
	- task:
	type: token-classification
	name: PII detection (German)
	dataset:
	name: ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
	type: ai4privacy/open-pii-masking-500k-ai4privacy
	split: validation
	args:
	language: de
	metrics:
	- type: f1
	value: 0.8706
	name: OPF-containment F1 (char-level, label-agnostic)
	- type: f1
	value: 0.8368
	name: Char-coverage F1 (label-aware)
	- type: f1
	value: 0.6445
	name: Strict span F1
	---

	# digitflow/privacy-filter-de-ft

	A German-language fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter).
	It exposes the same inference API and OPF label space as the base
	model, so existing OPF call sites work without changes on German
	input.

	Caveat. This model is not a perfect redactor for German PII. No
	warranty is provided and Digitflow accepts no legal responsibility
	for decisions made on its output. Use at your own risk. For
	non-German text, use [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
	directly.

	## Benchmark

	Evaluated on the German subset (`language == 'de'`, n = 1,000) of the
	[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
	validation split, scored with OPF-containment F1 (the char-level,
	label-agnostic completeness metric from the OPF reference scoring
	code). 95 % confidence intervals are estimated by 1,000-sample
	bootstrap resampling with replacement, taking the 2.5th and 97.5th
	percentiles of the resulting F1 distribution.

	\| Metric \| `openai/privacy-filter` \| `digitflow/privacy-filter-de-ft` \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| OPF-containment F1 \| 0.8437 \| 0.8706 \| +0.027 \|
	\| Leak rate (1 − char recall, label-agnostic) \| 23.05 % \| 20.49 % \| −2.56 pp \|
	\| Char-coverage F1, label-aware \| 0.6791 \| 0.8368 \| +0.158 \|
	\| Strict span F1 \| 0.4348 \| 0.6445 \| +0.210 \|
	\| Strict span precision \| 0.5645 \| 0.7518 \| +0.187 \|
	\| Strict span recall \| 0.3536 \| 0.5640 \| +0.210 \|

	\| Model \| OPF-containment F1 \| 95 % bootstrap CI \|
	\|---\|---:\|---\|
	\| `openai/privacy-filter` \| 0.8437 \| [0.8294, 0.8579] \|
	\| `digitflow/privacy-filter-de-ft` \| 0.8706 \| [0.8585, 0.8812] \|

	The intervals do not overlap; the +0.027 lift is significant against
	single-slice sampling noise.

	## Examples

	Output of `m.redact(text)`, formatted as `label:'redacted text'`.
	`(none)` means the model returned no spans.

	\| Input \| `openai/privacy-filter` \| `digitflow/privacy-filter-de-ft` \|
	\|---\|---\|---\|
	\| Mein Name ist Jürgen Müller und ich wohne in Hamburg. \| `(none)` \| `private_person:'Jürgen Müller'`, `private_address:'Hamburg'` \|
	\| Mein Passwort lautet SicherPasswort123! \| `(none)` \| `secret:'SicherPasswort123!'` \|
	\| Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. \| `(none)` \| `private_address:'Hauptstraße 25, 10115 Berlin'` \|
	\| Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. \| `(none)` \| `private_person:'Hans-Jürgen Brömmelmeyer'` \|
	\| Server-Status: https://intern.firma.de/health. \| `(none)` \| `private_url:'https://intern.firma.de/health'` \|
	\| Termin mit Mariella von Schönefeld-Brixius um 15:00. \| `private_person:'Mariella von Schönefeld-Brixius'` \| `private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'` \|

	## How it was built

	The fine-tune adapts the base model to German PII through slot-filled
	augmentation of public German carriers.

	It is supplemented by a hand-authored curriculum spanning real-world
	text registers, and trained on a single NVIDIA Jetson Orin.

	The training set is screened against the evaluation slice for
	contamination before training begins.

	## How to use it

	The OPF Python API is unchanged. Fetch the checkpoint with
	`huggingface_hub.snapshot_download(...)` and pass the resulting local
	path to `opf.OPF`.

	```python
	from huggingface_hub import snapshot_download
	import opf

	path = snapshot_download("digitflow/privacy-filter-de-ft")

	m = opf.OPF(
	model=path,
	device="cuda",
	output_mode="typed",
	decode_mode="viterbi",
	)

	text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg."
	result = m.redact(text)
	for span in result.detected_spans:
	print(f"{span.label}: {text[span.start:span.end]!r}")
	# private_person: 'Jürgen Müller'
	# private_address: 'Hamburg'
	```

	`snapshot_download` caches the weights under `~/.cache/huggingface/`
	so subsequent calls are free. The current `opf` release does not
	resolve a Hub repo id directly; it expects a local checkpoint
	directory.

	### Reproducing the benchmark

	```python
	from datasets import load_dataset
	from huggingface_hub import snapshot_download
	import opf
	# ... plus shared.span_prf and metrics.char_coverage_prf from the
	# openai/privacy-filter reference scoring code.

	ds = load_dataset(
	"ai4privacy/open-pii-masking-500k-ai4privacy",
	split="validation",
	)
	de = ds.filter(lambda r: r["language"] == "de").select(range(1000))

	ft_path = snapshot_download("digitflow/privacy-filter-de-ft")
	m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi")
	m_ft = opf.OPF(model=ft_path,
	device="cuda", output_mode="typed", decode_mode="viterbi")

	# Run m.redact() per row, collect predicted spans, score against gold
	# with `char_coverage_prf(predictions, golds, label_aware=False)`.
	# Report the __micro__.f1 as OPF-containment F1.
	```

	## License and citations

	License. [MIT](./LICENSE).

	[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
	was used as the source of training carriers (with augmentation) and
	as the validation slice for the benchmark above.

	[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
	is the base model (Apache 2.0).