---
license: mit
language:
- de
base_model: openai/privacy-filter
pipeline_tag: token-classification
library_name: opf
tags:
- pii
- privacy
- ner
- token-classification
- german
- de
- privacy-filter
- opf
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
metrics:
- f1
model-index:
- name: digitflow/privacy-filter-de-ft
  results:
  - task:
      type: token-classification
      name: PII detection (German)
    dataset:
      name: ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
      type: ai4privacy/open-pii-masking-500k-ai4privacy
      split: validation
      args:
        language: de
    metrics:
    - type: f1
      value: 0.8706
      name: OPF-containment F1 (char-level, label-agnostic)
    - type: f1
      value: 0.8368
      name: Char-coverage F1 (label-aware)
    - type: f1
      value: 0.6445
      name: Strict span F1
---

# digitflow/privacy-filter-de-ft

A German-language fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter).
It exposes the same inference API and OPF label space as the base
model, so existing OPF call sites work without changes on German
input.

**Caveat.** This model is not a perfect redactor for German PII. No
warranty is provided and Digitflow accepts no legal responsibility
for decisions made on its output. Use at your own risk. For
non-German text, use [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
directly.

## Benchmark

Evaluated on the German subset (`language == 'de'`, n = 1,000) of the
[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
validation split, scored with OPF-containment F1 (the char-level,
label-agnostic completeness metric from the OPF reference scoring
code). 95 % confidence intervals are estimated by 1,000-sample
bootstrap resampling with replacement, taking the 2.5th and 97.5th
percentiles of the resulting F1 distribution.

| Metric | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | Δ |
|---|---:|---:|---:|
| **OPF-containment F1** | 0.8437 | **0.8706** | **+0.027** |
| Leak rate (1 − char recall, label-agnostic) | 23.05 % | **20.49 %** | **−2.56 pp** |
| Char-coverage F1, label-aware | 0.6791 | **0.8368** | **+0.158** |
| Strict span F1 | 0.4348 | **0.6445** | **+0.210** |
| Strict span precision | 0.5645 | **0.7518** | +0.187 |
| Strict span recall | 0.3536 | **0.5640** | +0.210 |

| Model | OPF-containment F1 | 95 % bootstrap CI |
|---|---:|---|
| `openai/privacy-filter` | 0.8437 | [0.8294, 0.8579] |
| `digitflow/privacy-filter-de-ft` | 0.8706 | [0.8585, 0.8812] |

The intervals do not overlap; the +0.027 lift is significant against
single-slice sampling noise.

## Examples

Output of `m.redact(text)`, formatted as `label:'redacted text'`.
`(none)` means the model returned no spans.

| Input | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` |
|---|---|---|
| Mein Name ist Jürgen Müller und ich wohne in Hamburg. | `(none)` | `private_person:'Jürgen Müller'`, `private_address:'Hamburg'` |
| Mein Passwort lautet SicherPasswort123! | `(none)` | `secret:'SicherPasswort123!'` |
| Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. | `(none)` | `private_address:'Hauptstraße 25, 10115 Berlin'` |
| Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. | `(none)` | `private_person:'Hans-Jürgen Brömmelmeyer'` |
| Server-Status: https://intern.firma.de/health. | `(none)` | `private_url:'https://intern.firma.de/health'` |
| Termin mit Mariella von Schönefeld-Brixius um 15:00. | `private_person:'Mariella von Schönefeld-Brixius'` | `private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'` |

## How it was built

The fine-tune adapts the base model to German PII through slot-filled
augmentation of public German carriers.

It is supplemented by a hand-authored curriculum spanning real-world
text registers, and trained on a single NVIDIA Jetson Orin.

The training set is screened against the evaluation slice for
contamination before training begins.

## How to use it

The OPF Python API is unchanged. Fetch the checkpoint with
`huggingface_hub.snapshot_download(...)` and pass the resulting local
path to `opf.OPF`.

```python
from huggingface_hub import snapshot_download
import opf

path = snapshot_download("digitflow/privacy-filter-de-ft")

m = opf.OPF(
    model=path,
    device="cuda",
    output_mode="typed",
    decode_mode="viterbi",
)

text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg."
result = m.redact(text)
for span in result.detected_spans:
    print(f"{span.label}: {text[span.start:span.end]!r}")
# private_person: 'Jürgen Müller'
# private_address: 'Hamburg'
```

`snapshot_download` caches the weights under `~/.cache/huggingface/`
so subsequent calls are free. The current `opf` release does not
resolve a Hub repo id directly; it expects a local checkpoint
directory.

### Reproducing the benchmark

```python
from datasets import load_dataset
from huggingface_hub import snapshot_download
import opf
# ... plus shared.span_prf and metrics.char_coverage_prf from the
# openai/privacy-filter reference scoring code.

ds = load_dataset(
    "ai4privacy/open-pii-masking-500k-ai4privacy",
    split="validation",
)
de = ds.filter(lambda r: r["language"] == "de").select(range(1000))

ft_path = snapshot_download("digitflow/privacy-filter-de-ft")
m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi")
m_ft   = opf.OPF(model=ft_path,
                 device="cuda", output_mode="typed", decode_mode="viterbi")

# Run m.redact() per row, collect predicted spans, score against gold
# with `char_coverage_prf(predictions, golds, label_aware=False)`.
# Report the __micro__.f1 as OPF-containment F1.
```

## License and citations

**License.** [MIT](./LICENSE).

[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
was used as the source of training carriers (with augmentation) and
as the validation slice for the benchmark above.

[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
is the base model (Apache 2.0).