--- license: mit language: - de base_model: openai/privacy-filter pipeline_tag: token-classification library_name: opf tags: - pii - privacy - ner - token-classification - german - de - privacy-filter - opf datasets: - ai4privacy/open-pii-masking-500k-ai4privacy metrics: - f1 model-index: - name: digitflow/privacy-filter-de-ft results: - task: type: token-classification name: PII detection (German) dataset: name: ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000) type: ai4privacy/open-pii-masking-500k-ai4privacy split: validation args: language: de metrics: - type: f1 value: 0.8706 name: OPF-containment F1 (char-level, label-agnostic) - type: f1 value: 0.8368 name: Char-coverage F1 (label-aware) - type: f1 value: 0.6445 name: Strict span F1 --- # digitflow/privacy-filter-de-ft A German-language fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). It exposes the same inference API and OPF label space as the base model, so existing OPF call sites work without changes on German input. **Caveat.** This model is not a perfect redactor for German PII. No warranty is provided and Digitflow accepts no legal responsibility for decisions made on its output. Use at your own risk. For non-German text, use [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) directly. ## Benchmark Evaluated on the German subset (`language == 'de'`, n = 1,000) of the [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) validation split, scored with OPF-containment F1 (the char-level, label-agnostic completeness metric from the OPF reference scoring code). 95 % confidence intervals are estimated by 1,000-sample bootstrap resampling with replacement, taking the 2.5th and 97.5th percentiles of the resulting F1 distribution. | Metric | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | Δ | |---|---:|---:|---:| | **OPF-containment F1** | 0.8437 | **0.8706** | **+0.027** | | Leak rate (1 − char recall, label-agnostic) | 23.05 % | **20.49 %** | **−2.56 pp** | | Char-coverage F1, label-aware | 0.6791 | **0.8368** | **+0.158** | | Strict span F1 | 0.4348 | **0.6445** | **+0.210** | | Strict span precision | 0.5645 | **0.7518** | +0.187 | | Strict span recall | 0.3536 | **0.5640** | +0.210 | | Model | OPF-containment F1 | 95 % bootstrap CI | |---|---:|---| | `openai/privacy-filter` | 0.8437 | [0.8294, 0.8579] | | `digitflow/privacy-filter-de-ft` | 0.8706 | [0.8585, 0.8812] | The intervals do not overlap; the +0.027 lift is significant against single-slice sampling noise. ## Examples Output of `m.redact(text)`, formatted as `label:'redacted text'`. `(none)` means the model returned no spans. | Input | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | |---|---|---| | Mein Name ist Jürgen Müller und ich wohne in Hamburg. | `(none)` | `private_person:'Jürgen Müller'`, `private_address:'Hamburg'` | | Mein Passwort lautet SicherPasswort123! | `(none)` | `secret:'SicherPasswort123!'` | | Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. | `(none)` | `private_address:'Hauptstraße 25, 10115 Berlin'` | | Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. | `(none)` | `private_person:'Hans-Jürgen Brömmelmeyer'` | | Server-Status: https://intern.firma.de/health. | `(none)` | `private_url:'https://intern.firma.de/health'` | | Termin mit Mariella von Schönefeld-Brixius um 15:00. | `private_person:'Mariella von Schönefeld-Brixius'` | `private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'` | ## How it was built The fine-tune adapts the base model to German PII through slot-filled augmentation of public German carriers. It is supplemented by a hand-authored curriculum spanning real-world text registers, and trained on a single NVIDIA Jetson Orin. The training set is screened against the evaluation slice for contamination before training begins. ## How to use it The OPF Python API is unchanged. Fetch the checkpoint with `huggingface_hub.snapshot_download(...)` and pass the resulting local path to `opf.OPF`. ```python from huggingface_hub import snapshot_download import opf path = snapshot_download("digitflow/privacy-filter-de-ft") m = opf.OPF( model=path, device="cuda", output_mode="typed", decode_mode="viterbi", ) text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg." result = m.redact(text) for span in result.detected_spans: print(f"{span.label}: {text[span.start:span.end]!r}") # private_person: 'Jürgen Müller' # private_address: 'Hamburg' ``` `snapshot_download` caches the weights under `~/.cache/huggingface/` so subsequent calls are free. The current `opf` release does not resolve a Hub repo id directly; it expects a local checkpoint directory. ### Reproducing the benchmark ```python from datasets import load_dataset from huggingface_hub import snapshot_download import opf # ... plus shared.span_prf and metrics.char_coverage_prf from the # openai/privacy-filter reference scoring code. ds = load_dataset( "ai4privacy/open-pii-masking-500k-ai4privacy", split="validation", ) de = ds.filter(lambda r: r["language"] == "de").select(range(1000)) ft_path = snapshot_download("digitflow/privacy-filter-de-ft") m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi") m_ft = opf.OPF(model=ft_path, device="cuda", output_mode="typed", decode_mode="viterbi") # Run m.redact() per row, collect predicted spans, score against gold # with `char_coverage_prf(predictions, golds, label_aware=False)`. # Report the __micro__.f1 as OPF-containment F1. ``` ## License and citations **License.** [MIT](./LICENSE). [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) was used as the source of training carriers (with augmentation) and as the validation slice for the benchmark above. [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) is the base model (Apache 2.0).