mdhif-digitflow's picture
Release v2.1 · model + card
d5ffba1
---
license: mit
language:
- de
base_model: openai/privacy-filter
pipeline_tag: token-classification
library_name: opf
tags:
- pii
- privacy
- ner
- token-classification
- german
- de
- privacy-filter
- opf
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
metrics:
- f1
model-index:
- name: digitflow/privacy-filter-de-ft
results:
- task:
type: token-classification
name: PII detection (German)
dataset:
name: ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
type: ai4privacy/open-pii-masking-500k-ai4privacy
split: validation
args:
language: de
metrics:
- type: f1
value: 0.8706
name: OPF-containment F1 (char-level, label-agnostic)
- type: f1
value: 0.8368
name: Char-coverage F1 (label-aware)
- type: f1
value: 0.6445
name: Strict span F1
---
# digitflow/privacy-filter-de-ft
A German-language fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter).
It exposes the same inference API and OPF label space as the base
model, so existing OPF call sites work without changes on German
input.
**Caveat.** This model is not a perfect redactor for German PII. No
warranty is provided and Digitflow accepts no legal responsibility
for decisions made on its output. Use at your own risk. For
non-German text, use [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
directly.
## Benchmark
Evaluated on the German subset (`language == 'de'`, n = 1,000) of the
[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
validation split, scored with OPF-containment F1 (the char-level,
label-agnostic completeness metric from the OPF reference scoring
code). 95 % confidence intervals are estimated by 1,000-sample
bootstrap resampling with replacement, taking the 2.5th and 97.5th
percentiles of the resulting F1 distribution.
| Metric | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` | Δ |
|---|---:|---:|---:|
| **OPF-containment F1** | 0.8437 | **0.8706** | **+0.027** |
| Leak rate (1 − char recall, label-agnostic) | 23.05 % | **20.49 %** | **−2.56 pp** |
| Char-coverage F1, label-aware | 0.6791 | **0.8368** | **+0.158** |
| Strict span F1 | 0.4348 | **0.6445** | **+0.210** |
| Strict span precision | 0.5645 | **0.7518** | +0.187 |
| Strict span recall | 0.3536 | **0.5640** | +0.210 |
| Model | OPF-containment F1 | 95 % bootstrap CI |
|---|---:|---|
| `openai/privacy-filter` | 0.8437 | [0.8294, 0.8579] |
| `digitflow/privacy-filter-de-ft` | 0.8706 | [0.8585, 0.8812] |
The intervals do not overlap; the +0.027 lift is significant against
single-slice sampling noise.
## Examples
Output of `m.redact(text)`, formatted as `label:'redacted text'`.
`(none)` means the model returned no spans.
| Input | `openai/privacy-filter` | `digitflow/privacy-filter-de-ft` |
|---|---|---|
| Mein Name ist Jürgen Müller und ich wohne in Hamburg. | `(none)` | `private_person:'Jürgen Müller'`, `private_address:'Hamburg'` |
| Mein Passwort lautet SicherPasswort123! | `(none)` | `secret:'SicherPasswort123!'` |
| Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. | `(none)` | `private_address:'Hauptstraße 25, 10115 Berlin'` |
| Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. | `(none)` | `private_person:'Hans-Jürgen Brömmelmeyer'` |
| Server-Status: https://intern.firma.de/health. | `(none)` | `private_url:'https://intern.firma.de/health'` |
| Termin mit Mariella von Schönefeld-Brixius um 15:00. | `private_person:'Mariella von Schönefeld-Brixius'` | `private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'` |
## How it was built
The fine-tune adapts the base model to German PII through slot-filled
augmentation of public German carriers.
It is supplemented by a hand-authored curriculum spanning real-world
text registers, and trained on a single NVIDIA Jetson Orin.
The training set is screened against the evaluation slice for
contamination before training begins.
## How to use it
The OPF Python API is unchanged. Fetch the checkpoint with
`huggingface_hub.snapshot_download(...)` and pass the resulting local
path to `opf.OPF`.
```python
from huggingface_hub import snapshot_download
import opf
path = snapshot_download("digitflow/privacy-filter-de-ft")
m = opf.OPF(
model=path,
device="cuda",
output_mode="typed",
decode_mode="viterbi",
)
text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg."
result = m.redact(text)
for span in result.detected_spans:
print(f"{span.label}: {text[span.start:span.end]!r}")
# private_person: 'Jürgen Müller'
# private_address: 'Hamburg'
```
`snapshot_download` caches the weights under `~/.cache/huggingface/`
so subsequent calls are free. The current `opf` release does not
resolve a Hub repo id directly; it expects a local checkpoint
directory.
### Reproducing the benchmark
```python
from datasets import load_dataset
from huggingface_hub import snapshot_download
import opf
# ... plus shared.span_prf and metrics.char_coverage_prf from the
# openai/privacy-filter reference scoring code.
ds = load_dataset(
"ai4privacy/open-pii-masking-500k-ai4privacy",
split="validation",
)
de = ds.filter(lambda r: r["language"] == "de").select(range(1000))
ft_path = snapshot_download("digitflow/privacy-filter-de-ft")
m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi")
m_ft = opf.OPF(model=ft_path,
device="cuda", output_mode="typed", decode_mode="viterbi")
# Run m.redact() per row, collect predicted spans, score against gold
# with `char_coverage_prf(predictions, golds, label_aware=False)`.
# Report the __micro__.f1 as OPF-containment F1.
```
## License and citations
**License.** [MIT](./LICENSE).
[`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
was used as the source of training carriers (with augmentation) and
as the validation slice for the benchmark above.
[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
is the base model (Apache 2.0).