Token Classification
ONNX
Safetensors
English
bert
MikeG27 commited on
Commit
31a8d0e
·
verified ·
1 Parent(s): 2a11785

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -10
README.md CHANGED
@@ -2,6 +2,8 @@
2
  license: apache-2.0
3
  datasets:
4
  - beki/privy
 
 
5
  language:
6
  - en
7
  base_model:
@@ -10,8 +12,17 @@ pipeline_tag: token-classification
10
  ---
11
  # gravitee-io/bert-small-pii-detection 🚀
12
 
13
- **A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the [`beki/privy`](https://huggingface.co/datasets/beki/privy) dataset.
14
- Compared to the [`bert-tiny`](https://huggingface.co/gravitee-io/bert-tiny-pii-detection) variant, this model is larger and slower, but significantly improves precision and recall across most entity types.
 
 
 
 
 
 
 
 
 
15
 
16
  ### Label Set
17
 
@@ -21,8 +32,6 @@ IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NU
21
  TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
22
  ```
23
 
24
- ---
25
-
26
  ## How to Use
27
 
28
  ### Quick start (pipeline)
@@ -39,11 +48,9 @@ text = ""
39
  pipe(text)
40
  ```
41
 
42
- ---
43
 
44
  ## Evaluation
45
 
46
- **Test set:** `beki/privy` held-out split
47
  **Metric:** precision / recall / F1 per entity, micro/macro averages
48
 
49
  | Entity | Precision | Recall | F1-score | Support |
@@ -77,14 +84,12 @@ pipe(text)
77
  | **weighted avg** | 0.8785 | 0.8141 | 0.8446 | 19532 |
78
 
79
 
80
- ---
81
-
82
  ## Intended Uses & Limitations
83
 
84
  **Use this model for:**
85
 
 
86
  * Redacting PII in customer support logs, dev/test environments, API traces and articles
87
- * Pre-screening documents before storage or external sharing
88
  * Real-time hints in form fields or data entry systems
89
 
90
  **Limitations:**
@@ -92,7 +97,6 @@ pipe(text)
92
  * English-focused; other languages will degrade
93
  * Domain drift is real: audit on your own data
94
 
95
-
96
  ---
97
 
98
  ## Citation
@@ -125,6 +129,7 @@ If you use the model, please consider citing the papers:
125
  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
126
  biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
127
  bibsource = {dblp computer science bibliography, https://dblp.org}
 
128
 
129
  @online{WinNT,
130
  author = {Benjamin Kilimnik},
@@ -132,4 +137,20 @@ If you use the model, please consider citing the papers:
132
  year = 2022,
133
  url = {https://huggingface.co/datasets/beki/privy},
134
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
 
2
  license: apache-2.0
3
  datasets:
4
  - beki/privy
5
+ - gretelai/synthetic_pii_finance_multilingual
6
+ - eriktks/conll2003
7
  language:
8
  - en
9
  base_model:
 
12
  ---
13
  # gravitee-io/bert-small-pii-detection 🚀
14
 
15
+ **A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
16
+
17
+
18
+ ### About the dataset:
19
+
20
+ We combined various datasets in order to cover wide range of document formats like:
21
+ 1. JSON,
22
+ 2. HTML,
23
+ 3. XML,
24
+ 4. SQL
25
+ 5. Documents
26
 
27
  ### Label Set
28
 
 
32
  TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
33
  ```
34
 
 
 
35
  ## How to Use
36
 
37
  ### Quick start (pipeline)
 
48
  pipe(text)
49
  ```
50
 
 
51
 
52
  ## Evaluation
53
 
 
54
  **Metric:** precision / recall / F1 per entity, micro/macro averages
55
 
56
  | Entity | Precision | Recall | F1-score | Support |
 
84
  | **weighted avg** | 0.8785 | 0.8141 | 0.8446 | 19532 |
85
 
86
 
 
 
87
  ## Intended Uses & Limitations
88
 
89
  **Use this model for:**
90
 
91
+ * **Low resource environmens**
92
  * Redacting PII in customer support logs, dev/test environments, API traces and articles
 
93
  * Real-time hints in form fields or data entry systems
94
 
95
  **Limitations:**
 
97
  * English-focused; other languages will degrade
98
  * Domain drift is real: audit on your own data
99
 
 
100
  ---
101
 
102
  ## Citation
 
129
  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
130
  biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
131
  bibsource = {dblp computer science bibliography, https://dblp.org}
132
+ }
133
 
134
  @online{WinNT,
135
  author = {Benjamin Kilimnik},
 
137
  year = 2022,
138
  url = {https://huggingface.co/datasets/beki/privy},
139
  }
140
+
141
+ @online{gretel2023,
142
+ author = {Gretel.ai},
143
+ title = {{Synthetic PII Finance Multilingual Dataset}},
144
+ year = 2023,
145
+ url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
146
+ }
147
+
148
+ @inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
149
+ title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
150
+ author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
151
+ booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
152
+ year = "2003",
153
+ url = "https://aclanthology.org/W03-0419",
154
+ }
155
+ }
156
  ```