Token Classification
ONNX
Safetensors
English
bert
barflyman commited on
Commit
ef73998
·
verified ·
1 Parent(s): 8ce95a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -155
README.md CHANGED
@@ -1,156 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - beki/privy
5
- - gretelai/synthetic_pii_finance_multilingual
6
- - eriktks/conll2003
7
- language:
8
- - en
9
- base_model:
10
- - prajjwal1/bert-small
11
- pipeline_tag: token-classification
12
- ---
13
- # gravitee-io/bert-small-pii-detection 🚀
14
-
15
- **A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
16
-
17
-
18
- ### About the dataset:
19
-
20
- We combined various datasets in order to cover wide range of document formats like:
21
- 1. JSON,
22
- 2. HTML,
23
- 3. XML,
24
- 4. SQL
25
- 5. Documents
26
-
27
- ### Label Set
28
-
29
- ```
30
- AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
31
- IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
32
- TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
33
- ```
34
-
35
- ## How to Use
36
-
37
- ### Quick start (pipeline)
38
-
39
- ```python
40
- from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
41
-
42
- repo = "gravitee-io/bert-small-pii-detection"
43
- tok = AutoTokenizer.from_pretrained(repo)
44
- model = AutoModelForTokenClassification.from_pretrained(repo)
45
-
46
- pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
47
- text = ""
48
- pipe(text)
49
- ```
50
-
51
-
52
- ## Evaluation
53
-
54
- **Metric:** precision / recall / F1 per entity, micro/macro averages
55
-
56
- | Entity | Precision | Recall | F1-score | Support |
57
- |--------------------|-----------|--------|----------|---------|
58
- | AGE | 0.9898 | 0.8858 | 0.9349 | 219 |
59
- | COORDINATE | 0.9627 | 0.8738 | 0.9161 | 325 |
60
- | CREDIT_CARD | 0.9273 | 0.8870 | 0.9067 | 115 |
61
- | DATE_TIME | 0.8598 | 0.7364 | 0.7933 | 3255 |
62
- | EMAIL_ADDRESS | 0.9428 | 0.8941 | 0.9178 | 387 |
63
- | FINANCIAL | 0.9862 | 0.9565 | 0.9711 | 299 |
64
- | IBAN_CODE | 0.9577 | 0.9252 | 0.9412 | 147 |
65
- | IMEI | 0.9885 | 0.9663 | 0.9773 | 89 |
66
- | IP_ADDRESS | 0.9338 | 0.8812 | 0.9068 | 160 |
67
- | LOCATION | 0.8849 | 0.8222 | 0.8524 | 4264 |
68
- | MAC_ADDRESS | 0.9889 | 1.0000 | 0.9944 | 89 |
69
- | NRP | 1.0000 | 0.9818 | 0.9908 | 494 |
70
- | ORGANIZATION | 0.7454 | 0.6688 | 0.7051 | 3551 |
71
- | PASSWORD | 0.8384 | 0.8137 | 0.8259 | 102 |
72
- | PERSON | 0.9123 | 0.8826 | 0.8972 | 4454 |
73
- | PHONE_NUMBER | 0.9462 | 0.8199 | 0.8785 | 322 |
74
- | TITLE | 0.9887 | 0.9734 | 0.9810 | 451 |
75
- | URL | 1.0000 | 0.9787 | 0.9892 | 188 |
76
- | US_BANK_NUMBER | 1.0000 | 0.9579 | 0.9785 | 95 |
77
- | US_DRIVER_LICENSE | 0.9167 | 0.9167 | 0.9167 | 120 |
78
- | US_ITIN | 0.9659 | 0.8763 | 0.9189 | 97 |
79
- | US_LICENSE_PLATE | 1.0000 | 0.9000 | 0.9474 | 90 |
80
- | US_PASSPORT | 0.9200 | 0.9200 | 0.9200 | 100 |
81
- | US_SSN | 0.9744 | 0.9580 | 0.9661 | 119 |
82
- | **micro avg** | 0.8804 | 0.8141 | 0.8460 | 19532 |
83
- | **macro avg** | 0.9429 | 0.8948 | 0.9178 | 19532 |
84
- | **weighted avg** | 0.8785 | 0.8141 | 0.8446 | 19532 |
85
-
86
-
87
- ## Intended Uses & Limitations
88
-
89
- **Use this model for:**
90
-
91
- * **Low resource environmens**
92
- * Redacting PII in customer support logs, dev/test environments, API traces and articles
93
- * Real-time hints in form fields or data entry systems
94
-
95
- **Limitations:**
96
-
97
- * English-focused; other languages will degrade
98
- * Domain drift is real: audit on your own data
99
-
100
- ---
101
-
102
- ## Citation
103
-
104
- If you use the model, please consider citing the papers:
105
-
106
- ```
107
- @misc{bhargava2021generalization,
108
- title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
109
- author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
110
- year={2021},
111
- eprint={2110.01518},
112
- archivePrefix={arXiv},
113
- primaryClass={cs.CL}
114
- }
115
-
116
- @article{DBLP:journals/corr/abs-1908-08962,
117
- author = {Iulia Turc and
118
- Ming{-}Wei Chang and
119
- Kenton Lee and
120
- Kristina Toutanova},
121
- title = {Well-Read Students Learn Better: The Impact of Student Initialization
122
- on Knowledge Distillation},
123
- journal = {CoRR},
124
- volume = {abs/1908.08962},
125
- year = {2019},
126
- url = {http://arxiv.org/abs/1908.08962},
127
- eprinttype = {arXiv},
128
- eprint = {1908.08962},
129
- timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
130
- biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
131
- bibsource = {dblp computer science bibliography, https://dblp.org}
132
- }
133
-
134
- @online{WinNT,
135
- author = {Benjamin Kilimnik},
136
- title = {{Privy} Synthetic PII Protocol Trace Dataset},
137
- year = 2022,
138
- url = {https://huggingface.co/datasets/beki/privy},
139
- }
140
-
141
- @online{gretel2023,
142
- author = {Gretel.ai},
143
- title = {{Synthetic PII Finance Multilingual Dataset}},
144
- year = 2023,
145
- url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
146
- }
147
-
148
- @inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
149
- title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
150
- author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
151
- booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
152
- year = "2003",
153
- url = "https://aclanthology.org/W03-0419",
154
- }
155
- }
156
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - beki/privy
5
+ - gretelai/synthetic_pii_finance_multilingual
6
+ - eriktks/conll2003
7
+ language:
8
+ - en
9
+ base_model:
10
+ - prajjwal1/bert-small
11
+ pipeline_tag: token-classification
12
+ ---
13
+
14
+ **A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
15
+
16
+
17
+ ### About the dataset:
18
+
19
+ We combined various datasets in order to cover wide range of document formats like:
20
+ 1. JSON,
21
+ 2. HTML,
22
+ 3. XML,
23
+ 4. SQL
24
+ 5. Documents
25
+
26
+ ### Label Set
27
+
28
+ ```
29
+ AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
30
+ IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
31
+ TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
32
+ ```
33
+
34
+ ## How to Use
35
+
36
+ ### Quick start (pipeline)
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
40
+
41
+ repo = "gravitee-io/bert-small-pii-detection"
42
+ tok = AutoTokenizer.from_pretrained(repo)
43
+ model = AutoModelForTokenClassification.from_pretrained(repo)
44
+
45
+ pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
46
+ text = ""
47
+ pipe(text)
48
+ ```
49
+
50
+
51
+ ## Evaluation
52
+
53
+ **Metric:** precision / recall / F1 per entity, micro/macro averages
54
+
55
+ | Entity | Precision | Recall | F1-score | Support |
56
+ |--------------------|-----------|--------|----------|---------|
57
+ | AGE | 0.9898 | 0.8858 | 0.9349 | 219 |
58
+ | COORDINATE | 0.9627 | 0.8738 | 0.9161 | 325 |
59
+ | CREDIT_CARD | 0.9273 | 0.8870 | 0.9067 | 115 |
60
+ | DATE_TIME | 0.8598 | 0.7364 | 0.7933 | 3255 |
61
+ | EMAIL_ADDRESS | 0.9428 | 0.8941 | 0.9178 | 387 |
62
+ | FINANCIAL | 0.9862 | 0.9565 | 0.9711 | 299 |
63
+ | IBAN_CODE | 0.9577 | 0.9252 | 0.9412 | 147 |
64
+ | IMEI | 0.9885 | 0.9663 | 0.9773 | 89 |
65
+ | IP_ADDRESS | 0.9338 | 0.8812 | 0.9068 | 160 |
66
+ | LOCATION | 0.8849 | 0.8222 | 0.8524 | 4264 |
67
+ | MAC_ADDRESS | 0.9889 | 1.0000 | 0.9944 | 89 |
68
+ | NRP | 1.0000 | 0.9818 | 0.9908 | 494 |
69
+ | ORGANIZATION | 0.7454 | 0.6688 | 0.7051 | 3551 |
70
+ | PASSWORD | 0.8384 | 0.8137 | 0.8259 | 102 |
71
+ | PERSON | 0.9123 | 0.8826 | 0.8972 | 4454 |
72
+ | PHONE_NUMBER | 0.9462 | 0.8199 | 0.8785 | 322 |
73
+ | TITLE | 0.9887 | 0.9734 | 0.9810 | 451 |
74
+ | URL | 1.0000 | 0.9787 | 0.9892 | 188 |
75
+ | US_BANK_NUMBER | 1.0000 | 0.9579 | 0.9785 | 95 |
76
+ | US_DRIVER_LICENSE | 0.9167 | 0.9167 | 0.9167 | 120 |
77
+ | US_ITIN | 0.9659 | 0.8763 | 0.9189 | 97 |
78
+ | US_LICENSE_PLATE | 1.0000 | 0.9000 | 0.9474 | 90 |
79
+ | US_PASSPORT | 0.9200 | 0.9200 | 0.9200 | 100 |
80
+ | US_SSN | 0.9744 | 0.9580 | 0.9661 | 119 |
81
+ | **micro avg** | 0.8804 | 0.8141 | 0.8460 | 19532 |
82
+ | **macro avg** | 0.9429 | 0.8948 | 0.9178 | 19532 |
83
+ | **weighted avg** | 0.8785 | 0.8141 | 0.8446 | 19532 |
84
+
85
+
86
+ ## Intended Uses & Limitations
87
+
88
+ **Use this model for:**
89
+
90
+ * **Low resource environmens**
91
+ * Redacting PII in customer support logs, dev/test environments, API traces and articles
92
+ * Real-time hints in form fields or data entry systems
93
+
94
+ **Limitations:**
95
+
96
+ * English-focused; other languages will degrade
97
+ * Domain drift is real: audit on your own data
98
+
99
+ ---
100
+
101
+ ## Citation
102
+
103
+ If you use the model, please consider citing the papers:
104
+
105
+ ```
106
+ @misc{bhargava2021generalization,
107
+ title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
108
+ author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
109
+ year={2021},
110
+ eprint={2110.01518},
111
+ archivePrefix={arXiv},
112
+ primaryClass={cs.CL}
113
+ }
114
+
115
+ @article{DBLP:journals/corr/abs-1908-08962,
116
+ author = {Iulia Turc and
117
+ Ming{-}Wei Chang and
118
+ Kenton Lee and
119
+ Kristina Toutanova},
120
+ title = {Well-Read Students Learn Better: The Impact of Student Initialization
121
+ on Knowledge Distillation},
122
+ journal = {CoRR},
123
+ volume = {abs/1908.08962},
124
+ year = {2019},
125
+ url = {http://arxiv.org/abs/1908.08962},
126
+ eprinttype = {arXiv},
127
+ eprint = {1908.08962},
128
+ timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
129
+ biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
130
+ bibsource = {dblp computer science bibliography, https://dblp.org}
131
+ }
132
+
133
+ @online{WinNT,
134
+ author = {Benjamin Kilimnik},
135
+ title = {{Privy} Synthetic PII Protocol Trace Dataset},
136
+ year = 2022,
137
+ url = {https://huggingface.co/datasets/beki/privy},
138
+ }
139
+
140
+ @online{gretel2023,
141
+ author = {Gretel.ai},
142
+ title = {{Synthetic PII Finance Multilingual Dataset}},
143
+ year = 2023,
144
+ url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
145
+ }
146
+
147
+ @inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
148
+ title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
149
+ author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
150
+ booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
151
+ year = "2003",
152
+ url = "https://aclanthology.org/W03-0419",
153
+ }
154
+ }
 
155
  ```