| --- |
| license: mit |
| language: |
| - de |
| base_model: openai/privacy-filter |
| pipeline_tag: token-classification |
| library_name: opf |
| tags: |
| - pii |
| - privacy |
| - ner |
| - token-classification |
| - german |
| - de |
| - privacy-filter |
| - opf |
| datasets: |
| - ai4privacy/open-pii-masking-500k-ai4privacy |
| metrics: |
| - f1 |
| model-index: |
| - name: digitflow/privacy-filter-de-ft |
| results: |
| - task: |
| type: token-classification |
| name: PII detection (German) |
| dataset: |
| name: ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000) |
| type: ai4privacy/open-pii-masking-500k-ai4privacy |
| split: validation |
| args: |
| language: de |
| metrics: |
| - type: f1 |
| value: 0.8706 |
| name: OPF-containment F1 (char-level, label-agnostic) |
| - type: f1 |
| value: 0.8368 |
| name: Char-coverage F1 (label-aware) |
| - type: f1 |
| value: 0.6445 |
| name: Strict span F1 |
| --- |
| |
| # digitflow/privacy-filter-de-ft |
|
|
| A German-language fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). |
| It exposes the same inference API and OPF label space as the base |
| model, so existing OPF call sites work without changes on German |
| input. |
|
|
| **Caveat.** This model is not a perfect redactor for German PII. No |
| warranty is provided and Digitflow accepts no legal responsibility |
| for decisions made on its output. Use at your own risk. For |
| non-German text, use [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) |
| directly. |
|
|
| ## Benchmark |
|
|
| Evaluated on the German subset (`language == 'de'`, n = 1,000) of the |
| [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) |
| validation split, scored with OPF-containment F1 (the char-level, |
| label-agnostic completeness metric from the OPF reference scoring |
| code). 95 % confidence intervals are estimated by 1,000-sample |
| bootstrap resampling with replacement, taking the 2.5th and 97.5th |
| percentiles of the resulting F1 distribution. |
|
|
| | Metric | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | Δ | |
| |---|---:|---:|---:| |
| | **OPF-containment F1** | 0.8437 | **0.8706** | **+0.027** | |
| | Leak rate (1 − char recall, label-agnostic) | 23.05 % | **20.49 %** | **−2.56 pp** | |
| | Char-coverage F1, label-aware | 0.6791 | **0.8368** | **+0.158** | |
| | Strict span F1 | 0.4348 | **0.6445** | **+0.210** | |
| | Strict span precision | 0.5645 | **0.7518** | +0.187 | |
| | Strict span recall | 0.3536 | **0.5640** | +0.210 | |
|
|
| | Model | OPF-containment F1 | 95 % bootstrap CI | |
| |---|---:|---| |
| | `openai/privacy-filter` | 0.8437 | [0.8294, 0.8579] | |
| | `digitflow/privacy-filter-de-ft` | 0.8706 | [0.8585, 0.8812] | |
|
|
| The intervals do not overlap; the +0.027 lift is significant against |
| single-slice sampling noise. |
|
|
| ## Examples |
|
|
| Output of `m.redact(text)`, formatted as `label:'redacted text'`. |
| `(none)` means the model returned no spans. |
|
|
| | Input | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | |
| |---|---|---| |
| | Mein Name ist Jürgen Müller und ich wohne in Hamburg. | `(none)` | `private_person:'Jürgen Müller'`, `private_address:'Hamburg'` | |
| | Mein Passwort lautet SicherPasswort123! | `(none)` | `secret:'SicherPasswort123!'` | |
| | Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. | `(none)` | `private_address:'Hauptstraße 25, 10115 Berlin'` | |
| | Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. | `(none)` | `private_person:'Hans-Jürgen Brömmelmeyer'` | |
| | Server-Status: https://intern.firma.de/health. | `(none)` | `private_url:'https://intern.firma.de/health'` | |
| | Termin mit Mariella von Schönefeld-Brixius um 15:00. | `private_person:'Mariella von Schönefeld-Brixius'` | `private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'` | |
|
|
| ## How it was built |
|
|
| The fine-tune adapts the base model to German PII through slot-filled |
| augmentation of public German carriers. |
|
|
| It is supplemented by a hand-authored curriculum spanning real-world |
| text registers, and trained on a single NVIDIA Jetson Orin. |
|
|
| The training set is screened against the evaluation slice for |
| contamination before training begins. |
|
|
| ## How to use it |
|
|
| The OPF Python API is unchanged. Fetch the checkpoint with |
| `huggingface_hub.snapshot_download(...)` and pass the resulting local |
| path to `opf.OPF`. |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import opf |
| |
| path = snapshot_download("digitflow/privacy-filter-de-ft") |
| |
| m = opf.OPF( |
| model=path, |
| device="cuda", |
| output_mode="typed", |
| decode_mode="viterbi", |
| ) |
| |
| text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg." |
| result = m.redact(text) |
| for span in result.detected_spans: |
| print(f"{span.label}: {text[span.start:span.end]!r}") |
| # private_person: 'Jürgen Müller' |
| # private_address: 'Hamburg' |
| ``` |
|
|
| `snapshot_download` caches the weights under `~/.cache/huggingface/` |
| so subsequent calls are free. The current `opf` release does not |
| resolve a Hub repo id directly; it expects a local checkpoint |
| directory. |
|
|
| ### Reproducing the benchmark |
|
|
| ```python |
| from datasets import load_dataset |
| from huggingface_hub import snapshot_download |
| import opf |
| # ... plus shared.span_prf and metrics.char_coverage_prf from the |
| # openai/privacy-filter reference scoring code. |
| |
| ds = load_dataset( |
| "ai4privacy/open-pii-masking-500k-ai4privacy", |
| split="validation", |
| ) |
| de = ds.filter(lambda r: r["language"] == "de").select(range(1000)) |
| |
| ft_path = snapshot_download("digitflow/privacy-filter-de-ft") |
| m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi") |
| m_ft = opf.OPF(model=ft_path, |
| device="cuda", output_mode="typed", decode_mode="viterbi") |
| |
| # Run m.redact() per row, collect predicted spans, score against gold |
| # with `char_coverage_prf(predictions, golds, label_aware=False)`. |
| # Report the __micro__.f1 as OPF-containment F1. |
| ``` |
|
|
| ## License and citations |
|
|
| **License.** [MIT](./LICENSE). |
|
|
| [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) |
| was used as the source of training carriers (with augmentation) and |
| as the validation slice for the benchmark above. |
|
|
| [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) |
| is the base model (Apache 2.0). |
|
|