karmaUI commited on
Commit
463bf05
·
verified ·
1 Parent(s): 3e164eb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ tags:
5
+ - ner
6
+ - pii
7
+ - privacy
8
+ - token-classification
9
+ - deberta
10
+ - onnx
11
+ library_name: onnxruntime
12
+ pipeline_tag: token-classification
13
+ ---
14
+
15
+ # Shade V5 — On-Device PII Detection
16
+
17
+ Fast, accurate PII (Personally Identifiable Information) detection model for privacy-preserving AI pipelines. Detects 12 entity types with 97.6% F1 score.
18
+
19
+ ## Quick Start
20
+
21
+ ```python
22
+ pip install veil-phantom
23
+ ```
24
+
25
+ ```python
26
+ from veil_phantom import VeilClient
27
+
28
+ veil = VeilClient() # auto-downloads this model
29
+ result = veil.redact("John Smith sent $5M to john@acme.com")
30
+ result.sanitized # "[PERSON_1] sent [AMOUNT_1] to [EMAIL_1]"
31
+ ```
32
+
33
+ ## Model Details
34
+
35
+ | Property | Value |
36
+ |----------|-------|
37
+ | Architecture | DeBERTa-v3-xsmall |
38
+ | Parameters | 22M |
39
+ | Format | ONNX |
40
+ | Size | 270 MB |
41
+ | Inference | <50ms on CPU |
42
+ | F1 Score | 97.6% (in-distribution) |
43
+ | F1 Score | 97.3% (out-of-distribution) |
44
+ | Task | BIO Token Classification |
45
+ | Labels | 25 (12 entity types × B/I + O) |
46
+
47
+ ## Entity Types
48
+
49
+ | Type | F1 | Examples |
50
+ |------|-----|----------|
51
+ | PERSON | 96.3% | Names (Western, African, Asian, South African) |
52
+ | ORG | 97.6% | Companies, institutions |
53
+ | EMAIL | 100% | Email addresses |
54
+ | PHONE | 98.4% | Phone numbers (international formats) |
55
+ | MONEY | 99.6% | Monetary amounts |
56
+ | DATE | 97.8% | Dates, times, schedules |
57
+ | ADDRESS | 99.4% | Street addresses |
58
+ | GOVID | 97.7% | SSN, SA ID, passport |
59
+ | BANKACCT | 92.9% | Bank account numbers, IBAN |
60
+ | CARD | 100% | Credit/debit card numbers |
61
+ | IPADDR | 100% | IP addresses |
62
+ | CASE | 97.8% | Legal case numbers |
63
+
64
+ ## Training
65
+
66
+ - **Base model**: microsoft/deberta-v3-xsmall
67
+ - **Training data**: 116K examples from business meetings, legal proceedings, financial transactions
68
+ - **Tokenizer**: Unigram (128K vocab)
69
+ - **OOD gap**: 0.3% (97.6% → 97.3%)
70
+
71
+ ## Files
72
+
73
+ - `ShadeV5.onnx` — ONNX model (270 MB)
74
+ - `tokenizer.json` — HuggingFace fast tokenizer
75
+ - `tokenizer_config.json` — Tokenizer configuration
76
+ - `shade_label_map.json` — BIO label → entity type mapping
77
+
78
+ ## License
79
+
80
+ Apache 2.0
81
+
82
+ ## Part of VeilPhantom
83
+
84
+ This model powers [VeilPhantom](https://github.com/veil-privacy/veil-phantom), an open-source PII redaction SDK for agentic AI pipelines.