How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("token-classification", model="screenpipe/pii-redactor")
# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("screenpipe/pii-redactor")
model = AutoModelForTokenClassification.from_pretrained("screenpipe/pii-redactor")
Quick Links

screenpipe-pii-redactor

A screenpipe project.

screenpipe's own PII redactor, purpose-built for the three surfaces an AI agent actually sees a user's machine through:

  1. Accessibility-tree dumps — the structured AX hierarchy macOS / Windows expose to assistive tech. Short, structured, full of labels like AXButton[Send to marcus@helios-ai.io].
  2. OCR'd screen text — what tools like screenpipe extract from screen recordings: window-title-shaped artifacts, app chrome, and the occasional long-form email or doc.
  3. Computer-use traces — what an agentic model (Claude Computer Use, GPT operators, etc.) reads when it drives a desktop.

These surfaces are short, sparse-context, and full of identifiers that slip past redactors trained on chat-style prose. This is a compact, multilingual token classifier trained in-house specifically for them. It is not OpenAI's Privacy Filter, and it is not a fine-tune of one — it's screenpipe's own model. 278 MB INT8 ONNX, ~9 ms p50 on CPU, runs fully offline.

License: CC BY-NC 4.0 (non-commercial). For commercial use — production redaction, SaaS / API embedding, AI-agent privacy middleware, custom fine-tunes — contact louis@screenpi.pe. See LICENSE.

Headline numbers

On ScreenLeak, our open benchmark for PII redaction on screen telemetry — n=422 hand-labelled desktop-telemetry strings, 13 categories, strict per-string zero-leak (every PII span in the string must be caught):

Model Zero-leak
Gemini 3.1 Pro 91.0% cloud API
GPT-5.5 90.7% cloud API
Claude Opus 4.7 87.8% cloud API
this model 86.7% local · 278 MB · ~9 ms CPU · $0/call
Google Cloud DLP 37.7% cloud API
Microsoft Presidio 35.4% local OSS

Within a few points of the frontier APIs, ~50 points above the flagship commercial PII products — at zero per-call cost, fully offline. Full methodology, confidence intervals, and per-framework (HIPAA / GDPR / PCI DSS / …) breakdowns: github.com/screenpipe/screenleak. Try it in your browser: screenpipe.github.io/screenleak/demo.

What it does

Span-level redaction. Given a string, returns the regions it thinks are PII, each classified into one of 12 canonical categories:

private_person, private_email, private_phone, private_address,
private_url, private_company, private_repo, private_handle,
private_channel, private_id, private_date, secret

secret covers passwords, API keys, JWTs, DB connection strings, PRIVATE-KEY block markers, etc.

Tip: screenpipe redacts each captured fragment independently (one AX node / OCR line / window title at a time) — that's the distribution the model is tuned for. Feeding a giant multi-entity blob in a single call degrades recall; split on natural boundaries first.

Inference

Browser (transformers.js):

import { pipeline } from "@huggingface/transformers";
const pii = await pipeline("token-classification", "screenpipe/pii-redactor", { dtype: "q8" });
const out = await pii("export OPENAI_API_KEY=sk-proj-Ab12Cd34Ef56Gh78");
// out: per-token tags; group consecutive B-/I- tags into spans

Python (transformers):

# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok   = AutoTokenizer.from_pretrained("screenpipe/pii-redactor")
model = AutoModelForTokenClassification.from_pretrained("screenpipe/pii-redactor").eval()
id2label = model.config.id2label

def redact(text):
    enc = tok(text, return_offsets_mapping=True, return_tensors="pt", truncation=True)
    offsets = enc.pop("offset_mapping")[0].tolist()
    with torch.no_grad():
        pred = model(**enc).logits.argmax(-1)[0].tolist()
    # decode BIO from offsets + argmax (aggregation_strategy is unreliable
    # for this tokenizer — walk the offsets yourself)
    spans, cur = [], None
    for (s, e), p in zip(offsets, pred):
        if s == e:                       # special token
            cur = None; continue
        lab = id2label[p]
        if lab == "O":
            cur = None; continue
        base = lab.split("-", 1)[-1]
        if cur and cur["label"] == base and not lab.startswith("B-"):
            cur["end"] = e
        else:
            cur = {"start": s, "end": e, "label": base}; spans.append(cur)
    return [(d["start"], d["end"], d["label"], text[d["start"]:d["end"]]) for d in spans]

print(redact("export OPENAI_API_KEY=sk-proj-Ab12Cd34Ef56Gh78"))
# -> [(..., ..., 'secret', 'sk-proj-Ab12Cd34Ef56Gh78')]

Production INT8 ONNX weights are in onnx/ — load with onnxruntime on any platform (CoreML / DirectML / CUDA / CPU baseline); the same file ships everywhere.

Multilingual

Handles 6 languages (en, fr, de, it, es, nl). English is strongest; Dutch is the weakest and is flagged as a known gap. Validate on your locale before deploying.

Limitations

  1. Sudo / login password prompts can leak. [sudo] password for alice: hunter2 may redact the username but survive the password. Pair with an OS-level keystroke-suppression policy.
  2. Multi-entity blobs degrade recall — redact per captured fragment (see tip above), not one giant concatenated string.
  3. Synthetic training data only. No real user data was used. Validate on YOUR data before deploying.
  4. Over-redaction (oversmash). The model errs toward redacting — good for privacy-first deployments; flag it if you need clean text downstream.
  5. Strict zero-leak metric. Absolute numbers depend on the evaluator's taxonomy and metric; macro-F1 is a more lenient lens.

License

CC BY-NC 4.0 — non-commercial use only. See NOTICE for third-party component attributions.

For commercial licensing (production deployment, redistribution, SaaS / API embedding, custom fine-tunes for your domain): louis@screenpi.pe.

Citation

@misc{screenpipe-pii-redactor-2026,
  title  = {screenpipe-pii-redactor: a PII redactor for accessibility
            trees, OCR'd screen text, and computer-use traces},
  author = {{screenpipe}},
  year   = {2026},
  url    = {https://huggingface.co/screenpipe/pii-redactor}
}
Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support