louis030195 commited on
Commit
b25b954
·
verified ·
1 Parent(s): d324f94

docs: rewrite card — screenpipe's own detector, v11 numbers, drop training-data internals

Browse files
Files changed (1) hide show
  1. README.md +72 -216
README.md CHANGED
@@ -9,9 +9,7 @@ tags:
9
  - privacy
10
  - redaction
11
  - object-detection
12
- - rf-detr
13
  - screen-capture
14
- - accessibility
15
  - computer-use
16
  - agentic
17
  - screenpipe
@@ -31,25 +29,14 @@ extra_gated_prompt: >-
31
  > A [screenpipe](https://screenpi.pe) project. The image-modality
32
  > companion to [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor).
33
 
34
- A fine-tuned **image PII detector** for the same three surfaces an AI
35
- agent sees a user's machine through:
36
-
37
- 1. **Screen captures** JPGs / PNGs of the user's screen, rendered
38
- text and structured chrome (Slack, Outlook, Cursor, Terminal,
39
- Confluence, GitHub, 1Password, calendars, browsers).
40
- 2. **Computer-use traces** the visual frames an agentic model
41
- (Claude Computer Use, GPT operator, etc.) reads when it controls a
42
- desktop.
43
- 3. **Accessibility-tree visualizations** — when an agent screenshots
44
- what it inferred from the AX tree to debug a tool call.
45
-
46
- These surfaces are **dense, multi-PII, semi-structured** in ways no
47
- prose-trained PII detector handles well. Returns pixel-space bounding
48
- boxes for 12 canonical PII categories.
49
-
50
- ONNX, ~108 MB. Same `.onnx` ships across macOS / Windows / Linux —
51
- the user's ONNX Runtime selects the Execution Provider at load time
52
- (CoreML, DirectML, CUDA, or CPU baseline).
53
 
54
  > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
55
  > production redaction, SaaS / API embedding, AI-agent privacy
@@ -58,65 +45,34 @@ the user's ONNX Runtime selects the Execution Provider at load time
58
 
59
  ## Headline numbers
60
 
61
- `rfdetr_v8` on a held-out 221-image validation split (190 PII-bearing,
62
- 31 hard negatives) of the [screenpipe-pii-bench-image](https://github.com/screenpipe/screenpipe-pii-bench-image)
63
- corpus, IoU ≥ 0.30:
64
-
65
- | metric | this model | regex+OCR floor | Microsoft Presidio (published OSS) |
66
- |---|---:|---:|---:|
67
- | **zero-leak** (every gold span caught) | **95.3%** | 2.6% | 0.5% |
68
- | **oversmash** (false-fire on negatives) | **0.0%** | 3.2% | 48.4% |
69
- | micro-precision | 99% | 87% | 47% |
70
- | micro-recall | 97% | 26% | 42% |
71
- | macro-F1 | 0.871 | 0.318 | 0.190 |
72
-
73
- Per-label recall (a few highlights): `private_person` 0.99 ·
74
- `private_company` 1.00 · `private_repo` 1.00 · `private_url` 1.00 ·
75
- `secret` 0.99 · `private_email` 0.98 · `private_phone` 0.92 ·
76
- `private_address` 0.92.
77
-
78
- ### Latency (rfdetr_v8, 320×320 input, FP32)
79
-
80
- | platform | EP | p50 |
81
- |-------------------------------|-----------|----------:|
82
- | macOS Apple Silicon (M-series) | CoreML | **66 ms** ([real-screen sample](https://github.com/screenpipe/screenpipe-pii-bench-image)) |
83
- | macOS Apple Silicon (M-series) | CPU | 163 ms |
84
- | Windows + DX12 GPU | DirectML | ~30-60 ms (estimated) |
85
- | Linux + NVIDIA | CUDA | ~10-20 ms (estimated) |
86
- | Linux/Windows CPU-only | CPU | ~140 ms |
87
-
88
- Same `.onnx` everywhere — Execution Provider is selected at load time
89
- by the user's ONNX Runtime build. **No CUDA / Vulkan / GPU vendor SDKs
90
- required at the consumer.**
91
-
92
- ## Why this exists (vs Presidio Image Redactor and friends)
93
-
94
- The published baselines are trained on prose / generic-document
95
- imagery. A typical screenpipe frame looks nothing like that:
96
-
97
- - A Slack channel sidebar with 8 names, 12 channel mentions, 3 emails,
98
- and 1 pasted AWS key — all in 1440×900 px at 14 px font.
99
- - A 1Password vault entry with structured `[Username | Password |
100
- Server | One-time password]` rows, half of which are masked dots.
101
- - A Cursor workspace open on `.env.production` with five secret-shaped
102
- values stacked top-to-bottom.
103
-
104
- These images are **dense** (10-20 PII spans per frame), **structured**
105
- (rows / columns / aligned chrome), and **layout-cued** (a thing in the
106
- "Username" cell is a username regardless of its surface text). A
107
- generic NER-on-OCR pipeline misfires by over-redacting UI chrome
108
- (48% false-fire on negatives in our bench, vs. 0% for this model).
109
-
110
- If you're building an **agentic system that reads screen state** — a
111
- desktop-control agent, a memory layer for browsing, anything that
112
- streams screen captures into an LLM — this is the redactor designed
113
- for that pipe.
114
 
115
  ## What it does
116
 
117
- Per-image **object detection**. Given a JPG or PNG, returns
118
- `[(bbox, label, score)]` where each detection is a region the model
119
- thinks is PII, classified into one of the 12 canonical categories
120
  shared with [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor):
121
 
122
  ```
@@ -126,170 +82,70 @@ private_channel, private_id, private_date, secret
126
  ```
127
 
128
  `secret` covers passwords, API keys, JWTs, DB connection strings,
129
- PRIVATE-KEY block markers, etc. — same coverage as the text model.
130
 
131
  ## Inference
132
 
133
  ```python
134
  # pip install onnxruntime pillow numpy
135
- import numpy as np
136
- import onnxruntime as ort
137
- from PIL import Image
138
 
139
- CLASSES = [
140
- "private_person", "private_email", "private_phone",
141
- "private_address", "private_url", "private_company",
142
- "private_repo", "private_handle", "private_channel",
143
- "private_id", "private_date", "secret",
144
- ]
145
- INPUT_SIZE = 320 # rfdetr_v8 was exported at 320x320
146
- THRESHOLD = 0.30
147
 
148
  sess = ort.InferenceSession(
149
- "rfdetr_v8.onnx",
150
  providers=["CoreMLExecutionProvider", "CPUExecutionProvider"],
151
  )
152
 
153
- img = Image.open("screenshot.png").convert("RGB")
154
- W, H = img.size
155
- resized = img.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
156
- arr = np.asarray(resized, dtype=np.float32) / 255.0
157
- arr = (arr - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
158
- arr = arr.transpose(2, 0, 1)[None].astype(np.float32) # NCHW
159
 
160
  boxes, logits = sess.run(None, {sess.get_inputs()[0].name: arr})
161
- boxes = boxes[0] # (300, 4) cx, cy, w, h normalized
162
- logits = logits[0] # (300, 13) last channel is "no-object"
163
-
164
- probs = 1.0 / (1.0 + np.exp(-logits[:, :12])) # per-class sigmoid
165
- best_class = probs.argmax(axis=1)
166
- best_score = probs[np.arange(300), best_class]
167
- keep = best_score >= THRESHOLD
168
 
169
- for q in np.where(keep)[0]:
 
170
  cx, cy, bw, bh = boxes[q]
171
- x1 = (cx - bw / 2) * W
172
- y1 = (cy - bh / 2) * H
173
- print(f" {CLASSES[best_class[q]]:18} score={best_score[q]:.2f} "
174
- f"bbox=[{int(x1)}, {int(y1)}, {int(bw*W)}, {int(bh*H)}]")
175
- ```
176
-
177
- Full example with image overlay → `examples/inference.py`.
178
-
179
- For Rust integration via the `ort` crate, see the
180
- [`rust_smoke/`](https://github.com/screenpipe/screenpipe-pii-bench-image/tree/main/rust_smoke)
181
- prototype and the production wiring in PR
182
- [`screenpipe/screenpipe#3188`](https://github.com/screenpipe/screenpipe/pull/3188).
183
-
184
- ## Redacting the image (vs. just detecting)
185
-
186
- This model **detects**. To actually remove the PII, draw a solid
187
- rectangle over each detected bbox. Solid black, **not blur** — blur
188
- is reversible by super-resolution attacks; opaque rectangles aren't.
189
-
190
- ```python
191
- from PIL import ImageDraw
192
- draw = ImageDraw.Draw(img)
193
- for det in detections: # from the snippet above
194
- x, y, w, h = det.bbox
195
- draw.rectangle([x, y, x + w, y + h], fill=(0, 0, 0))
196
  img.save("screenshot_redacted.png")
197
  ```
198
 
199
- That's the entire redactor wrapper. ~5 lines.
200
-
201
- ## Architecture
202
-
203
- - Base: [RF-DETR-Nano](https://github.com/roboflow/rf-detr) (Roboflow,
204
- ICLR 2026) — DINOv2-backbone real-time detection transformer, ~25 M
205
- params, claims first real-time model to break 60 mAP on COCO.
206
- - Fine-tuned at 320×320 input on a 2,833-image synthetic + WebPII
207
- union (synthetic via DOM-truth bbox extraction; WebPII via the
208
- [arxiv 2603.17357 release](https://arxiv.org/abs/2603.17357)).
209
- - Output head: 300 detection queries × 13 channels (12 PII classes +
210
- no-object). Per-class sigmoid (NOT softmax — RF-DETR uses
211
- independent classification per query).
212
- - Trained on a single A100 80 GB; ~100 minutes wall-clock for the
213
- best-EMA epoch.
214
-
215
- ## What was the training data
216
-
217
- | source | size | labels | notes |
218
- |---|---:|---|---|
219
- | **synthetic bench** | 2,206 imgs | DOM-truth bboxes (pixel-perfect) | 9 templates rendered via headless Chromium with `data-span` attributes — labels come from the same DOM tree the browser laid out. |
220
- | **WebPII** | 500 imgs (balanced sample) | bbox-labeled by the original authors | March 2026 release, e-commerce screenshots. Class-imbalance capped at 2× our synthetic frequency. |
221
- | **cascade auto-labels** | 100 imgs | OCR + text-PII model alignment | Old screenshots from this project's own bench, weakly labeled. |
222
-
223
- **No real user data was used during fine-tuning.** Membership
224
- inference attacks recover no real-user content because no real-user
225
- content was in the training set. If you discover a failure mode on
226
- your real screens, the project's recipe is to add a new SYNTHETIC
227
- template that reproduces it — the screenshot becomes a bug report,
228
- never a training row.
229
 
230
  ## Limitations
231
 
232
- 1. **Hand-curated gold set is small** bench `data/` has 5
233
- manually-built cases. Larger-scale held-out evaluation depends on
234
- the synthetic corpus, which is in-distribution by construction.
235
- 2. **`private_handle` and `private_id` recall are 0%** in the
236
- reference numbers because the val split has only 2 and 1 examples
237
- respectively. Don't deploy without a domain-specific eval pass.
238
- 3. **Synthetic-template ceiling.** 95.3% zero-leak is the bench's
239
- stable ceiling at this corpus size. Gains beyond come from training
240
- on more real-screen failure modes (tracked in the bench's backlog).
241
- 4. **WebPII is e-commerce-heavy.** Adding the full WebPII split
242
- actually *hurt* dev-app accuracy in our experiments (rfdetr_v4 at
243
- 90.5% zero-leak vs. v8's 95.3%). The 500-image balanced sample is
244
- our best-of-both compromise.
245
- 5. **CPU-only floors at ~140 ms p50.** INT8 quantization (planned)
246
- gets that under 100 ms, but the FP32 release is what's on this
247
- page today.
248
- 6. **English-only.** Synthetic templates render Latin-script text;
249
- the WebPII supplement is English. CJK / Arabic / Cyrillic not
250
- evaluated — don't deploy without a locale-specific eval.
251
- 7. **Adversarial robustness not tested.** A user who knows the
252
- detector exists could craft layouts that confuse it (handwritten
253
- PII, embedded-image PII, partial occlusion). Use this for
254
- honest-user privacy, not as a security boundary.
255
-
256
- ## Files
257
-
258
- ```
259
- rfdetr_v8.onnx 108 MB · the model · sha256 below
260
- README.md this file
261
- LICENSE CC BY-NC 4.0
262
- NOTICE attribution to base model + datasets
263
- examples/
264
- inference.py the snippet above, runnable
265
- ```
266
-
267
- SHA-256 of `rfdetr_v8.onnx`:
268
- `431acc0f0beb22a39572b7a50af4fc446e799840fb71320dc124fbd79a121eb3`
269
-
270
- ## Reproducing inference
271
-
272
- ```bash
273
- git clone https://huggingface.co/screenpipe/pii-image-redactor
274
- cd pii-image-redactor
275
- git lfs pull
276
- pip install onnxruntime pillow numpy
277
- python examples/inference.py path/to/your_screenshot.png
278
- ```
279
-
280
- Reproducing the eval scores requires the screenpipe-pii-bench-image
281
- benchmark, which is not redistributed (it's the training corpus).
282
- Contact **louis@screenpi.pe** for benchmark access or commercial
283
- licensing.
284
 
285
  ## License
286
 
287
- [CC BY-NC 4.0](LICENSE) — non-commercial use only. The base model
288
- (RF-DETR) is Apache-2.0; obligations are preserved (see
289
- [`NOTICE`](NOTICE)).
290
 
291
- For commercial licensing (production deployment, redistribution
292
- rights, SaaS / API embedding, custom fine-tunes for your domain):
293
  **louis@screenpi.pe**.
294
 
295
  ## Citation
 
9
  - privacy
10
  - redaction
11
  - object-detection
 
12
  - screen-capture
 
13
  - computer-use
14
  - agentic
15
  - screenpipe
 
29
  > A [screenpipe](https://screenpi.pe) project. The image-modality
30
  > companion to [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor).
31
 
32
+ screenpipe's own **image PII detector** it finds and boxes PII
33
+ *regions* directly in a screenshot, for the surfaces an AI agent sees a
34
+ user's machine through: screen captures, computer-use frames, and app
35
+ UIs (chat, terminals, settings panes, CRMs, browsers, password managers).
36
+ It is screenpipe's own model, trained in-house for this task. Returns
37
+ pixel-space bounding boxes for 12 canonical PII categories. ~109 MB ONNX;
38
+ the same file runs on macOS / Windows / Linux (CoreML / DirectML / CUDA /
39
+ CPU, selected at load time no GPU vendor SDKs required at the consumer).
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
42
  > production redaction, SaaS / API embedding, AI-agent privacy
 
45
 
46
  ## Headline numbers
47
 
48
+ On [**ScreenLeak**](https://github.com/screenpipe/screenleak),
49
+ PII-bearing screenshots, region match at IoU ≥ 0.30:
50
+
51
+ | Model | Region zero-leak | Oversmash |
52
+ |---|---:|---:|
53
+ | **this model** ⭐ local | **98.9%** | **0.0%** |
54
+ | Gemini 3.1 Pro | 4.2% | 9.7% |
55
+ | GPT-5.5 | 3.2% | 22.6% |
56
+ | Google Cloud DLP | 2.6% | 19.4% |
57
+ | Claude Opus 4.7 | 2.1% | 35.5% |
58
+ | Microsoft Presidio | 0.5% | 48.4% |
59
+
60
+ Frontier vision models can *name* what they see, but can't draw boxes
61
+ tight enough to count at IoU 0.30; a small specialized detector
62
+ decisively separates. ~120 ms p50 on Apple Silicon (CoreML). Full
63
+ methodology + confidence intervals:
64
+ [github.com/screenpipe/screenleak](https://github.com/screenpipe/screenleak).
65
+ **Try it in your browser:** [screenpipe.github.io/screenleak/demo](https://screenpipe.github.io/screenleak/demo/).
66
+
67
+ > **In-distribution caveat.** The headline is measured on a held-out
68
+ > split matched to the model's training conditions — an upper bound, not
69
+ > a real-screen guarantee. It is strongest on clean, standard app UIs;
70
+ > unusual or low-quality screens may be missed or over-boxed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ## What it does
73
 
74
+ Per-image object detection `[(bbox, label, score)]`, where each
75
+ detection is a region classified into one of the 12 canonical categories
 
76
  shared with [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor):
77
 
78
  ```
 
82
  ```
83
 
84
  `secret` covers passwords, API keys, JWTs, DB connection strings,
85
+ PRIVATE-KEY block markers, etc.
86
 
87
  ## Inference
88
 
89
  ```python
90
  # pip install onnxruntime pillow numpy
91
+ import numpy as np, onnxruntime as ort
92
+ from PIL import Image, ImageDraw
 
93
 
94
+ CLASSES = ["private_person","private_email","private_phone","private_address",
95
+ "private_url","private_company","private_repo","private_handle",
96
+ "private_channel","private_id","private_date","secret"]
97
+ SIZE, THRESH = 512, 0.30
 
 
 
 
98
 
99
  sess = ort.InferenceSession(
100
+ "rfdetr_v11.onnx",
101
  providers=["CoreMLExecutionProvider", "CPUExecutionProvider"],
102
  )
103
 
104
+ img = Image.open("screenshot.png").convert("RGB"); W, H = img.size
105
+ arr = np.asarray(img.resize((SIZE, SIZE), Image.BILINEAR), np.float32) / 255.0
106
+ arr = ((arr - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]).transpose(2, 0, 1)[None].astype(np.float32)
 
 
 
107
 
108
  boxes, logits = sess.run(None, {sess.get_inputs()[0].name: arr})
109
+ boxes, logits = boxes[0], logits[0] # (Q,4) cxcywh normalized · (Q,13)
110
+ probs = 1.0 / (1.0 + np.exp(-logits[:, :12])) # per-class sigmoid (NOT softmax)
111
+ score = probs.max(1)
 
 
 
 
112
 
113
+ draw = ImageDraw.Draw(img) # redact = draw opaque boxes
114
+ for q in np.where(score >= THRESH)[0]:
115
  cx, cy, bw, bh = boxes[q]
116
+ x1, y1 = (cx - bw / 2) * W, (cy - bh / 2) * H
117
+ draw.rectangle([x1, y1, x1 + bw * W, y1 + bh * H], fill=(0, 0, 0))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  img.save("screenshot_redacted.png")
119
  ```
120
 
121
+ Use **solid black, not blur** blur is reversible by super-resolution
122
+ attacks; opaque rectangles aren't. Output is 300 detection queries × 13
123
+ channels (12 PII classes + a no-object channel), per-class sigmoid.
124
+ Latest weights: `rfdetr_v11.onnx` (512×512 input).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  ## Limitations
127
 
128
+ 1. **In-distribution headline.** 98.9% is the held-out ceiling under
129
+ matched conditions; real, unusual screens will score lower.
130
+ 2. **It's a localizer don't filter on its class label.** It reliably
131
+ *finds* PII regions, but its per-region *category* prediction is not
132
+ reliable on out-of-distribution screens. Redact every detected region
133
+ rather than filtering by predicted class.
134
+ 3. **Synthetic training data only** no real user data. Validate on your
135
+ screens before deploying.
136
+ 4. **English / Latin-script** evaluated; CJK / Arabic / Cyrillic not —
137
+ run a locale-specific eval first.
138
+ 5. **Not a security boundary.** Built for honest-user privacy; an
139
+ adversary who knows the detector exists can craft layouts to evade it
140
+ (handwritten, embedded-image, or occluded PII).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  ## License
143
 
144
+ [CC BY-NC 4.0](LICENSE) — non-commercial use only. See [`NOTICE`](NOTICE)
145
+ for third-party component attributions.
 
146
 
147
+ For commercial licensing (production deployment, redistribution rights,
148
+ SaaS / API embedding, custom fine-tunes for your domain):
149
  **louis@screenpi.pe**.
150
 
151
  ## Citation