rm0013 commited on
Commit
17e302d
·
verified ·
1 Parent(s): d2c5991

Model files

Browse files
Files changed (5) hide show
  1. README.md +92 -0
  2. classification_report.txt +59 -0
  3. config.json +182 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +16 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - ner
7
+ - pii
8
+ - pci
9
+ - token-classification
10
+ - roberta
11
+ datasets:
12
+ - ai4privacy/pii-masking-200k
13
+ metrics:
14
+ - f1
15
+ model-index:
16
+ - name: roberta-pii-ner-en
17
+ results:
18
+ - task:
19
+ type: token-classification
20
+ name: Named Entity Recognition
21
+ dataset:
22
+ name: ai4privacy/pii-masking-200k
23
+ type: ai4privacy/pii-masking-200k
24
+ metrics:
25
+ - type: f1
26
+ value: 0.95
27
+ name: Micro F1
28
+ ---
29
+
30
+ # roberta-pii-ner-en
31
+
32
+ Fine-tuned [roberta-base](https://huggingface.co/roberta-base) for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.
33
+
34
+ **GitHub:** [rakmohan/pii-ner-en](https://github.com/rakmohan/pii-ner-en)
35
+
36
+ ## Model Performance
37
+
38
+ | Metric | Score |
39
+ |--------|-------|
40
+ | Micro avg F1 | **0.95** |
41
+ | Macro avg F1 | 0.94 |
42
+ | Weighted avg F1 | 0.95 |
43
+
44
+ Per-entity metrics are available in `classification_report.txt`.
45
+
46
+ ## Usage
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ ner = pipeline(
52
+ "token-classification",
53
+ model="rm0013/roberta-pii-ner-en",
54
+ aggregation_strategy="simple"
55
+ )
56
+
57
+ result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
58
+ for entity in result:
59
+ print(f"{entity['word']:30s} → {entity['entity_group']} ({entity['score']:.2f})")
60
+ ```
61
+
62
+ ## Supported Entities (54 types)
63
+
64
+ **PII:** `PERSON_NAME` `EMAIL` `PHONE_NUMBER` `SSN` `ADDRESS` `SECONDARYADDRESS` `DATE_OF_BIRTH` `DATE` `TIME` `AGE` `GENDER` `USERNAME` `PASSWORD` `IP_ADDRESS` `URL` `API_KEY` `PASSPORT_NUMBER` `DRIVER_LICENSE` `ORGANIZATION` `COMPANYNAME` `ACCOUNTNAME` `JOBAREA` `JOBTITLE` `JOBTYPE` `HEIGHT` `EYECOLOR` `ORDINALDIRECTION` `GPS_COORDINATES` `NEARBYGPSCOORDINATE` `USERAGENT` `DEVICE_ID` `VEHICLE_ID` `VEHICLEVIN` `VEHICLEVRM` `PHONEIMEI`
65
+
66
+ **PCI / Financial:** `CREDIT_CARD` `CREDIT_CARD_CVV` `CREDIT_CARD_EXPIRY` `PIN` `BANK_ACCOUNT` `BANK_ROUTING` `BIC` `AMOUNT` `CURRENCY` `CURRENCYCODE` `CURRENCYNAME` `CURRENCYSYMBOL` `MASKEDNUMBER` `BITCOINADDRESS` `ETHEREUMADDRESS` `LITECOINADDRESS`
67
+
68
+ ## Training Data
69
+
70
+ 1. [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k) — 200k annotated PII examples
71
+ 2. Synthetic PCI data generated with [Faker](https://faker.readthedocs.io/) for rare entities (CVV, PIN, routing numbers)
72
+
73
+ ## Training Details
74
+
75
+ | Parameter | Value |
76
+ |-----------|-------|
77
+ | Base model | roberta-base |
78
+ | Epochs | 10 (early stopping patience 3) |
79
+ | Batch size | 32 |
80
+ | Learning rate | 2e-5 |
81
+ | Max sequence length | 256 |
82
+ | Mixed precision | FP16 |
83
+
84
+ ## Limitations
85
+
86
+ - English text only
87
+ - Performance varies by entity type — currency-related entities (CURRENCY, CURRENCYNAME) have lower accuracy due to limited training signal
88
+ - Not tested on non-standard text formats (code, structured data)
89
+
90
+ ## License
91
+
92
+ MIT
classification_report.txt ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ precision recall f1-score support
2
+
3
+ ACCOUNTNAME 0.99 1.00 1.00 1257
4
+ ADDRESS 0.98 0.95 0.96 7679
5
+ AGE 0.95 0.95 0.95 1330
6
+ AMOUNT 0.95 0.98 0.96 422
7
+ API_KEY 1.00 1.00 1.00 89
8
+ BANK_ACCOUNT 0.99 1.00 0.99 2454
9
+ BANK_ROUTING 1.00 1.00 1.00 192
10
+ BIC 0.98 1.00 0.99 337
11
+ BITCOINADDRESS 0.89 1.00 0.94 1097
12
+ COMPANYNAME 0.99 0.99 0.99 1229
13
+ CREDITCARDISSUER 0.98 1.00 0.99 645
14
+ CREDIT_CARD 0.81 0.90 0.85 2100
15
+ CREDIT_CARD_CVV 0.96 0.98 0.97 905
16
+ CREDIT_CARD_EXPIRY 1.00 1.00 1.00 357
17
+ CURRENCY 0.60 0.94 0.73 764
18
+ CURRENCYCODE 0.88 0.84 0.86 297
19
+ CURRENCYNAME 0.35 0.06 0.11 392
20
+ CURRENCYSYMBOL 0.94 0.99 0.96 1207
21
+ DATE 0.71 0.97 0.82 2153
22
+ DATE_OF_BIRTH 0.87 0.42 0.57 1455
23
+ DEVICE_ID 1.00 1.00 1.00 545
24
+ DRIVER_LICENSE 0.98 1.00 0.99 107
25
+ EMAIL 1.00 1.00 1.00 1905
26
+ ETHEREUMADDRESS 1.00 1.00 1.00 704
27
+ EYECOLOR 0.95 0.99 0.97 464
28
+ GENDER 0.99 0.99 0.99 1240
29
+ GPS_COORDINATES 1.00 1.00 1.00 59
30
+ HEIGHT 0.98 0.99 0.98 471
31
+ IP_ADDRESS 1.00 1.00 1.00 3422
32
+ JOBAREA 0.97 0.99 0.98 1305
33
+ JOBTITLE 0.98 1.00 0.99 1292
34
+ JOBTYPE 0.99 0.99 0.99 1266
35
+ LITECOINADDRESS 1.00 0.64 0.78 371
36
+ MASKEDNUMBER 0.69 0.54 0.61 952
37
+ NEARBYGPSCOORDINATE 1.00 1.00 1.00 911
38
+ ORDINALDIRECTION 0.98 1.00 0.99 599
39
+ ORGANIZATION 0.99 0.98 0.99 107
40
+ PASSPORT_NUMBER 1.00 1.00 1.00 91
41
+ PASSWORD 1.00 1.00 1.00 1322
42
+ PERSON_NAME 0.98 0.99 0.98 12137
43
+ PHONEIMEI 1.00 1.00 1.00 917
44
+ PHONE_NUMBER 0.98 0.99 0.99 1313
45
+ PIN 0.86 0.96 0.91 655
46
+ SECONDARYADDRESS 0.98 1.00 0.99 1204
47
+ SEX 1.00 1.00 1.00 1340
48
+ SSN 0.99 0.98 0.99 1178
49
+ TIME 0.91 1.00 0.95 1335
50
+ URL 1.00 1.00 1.00 1188
51
+ USERAGENT 1.00 1.00 1.00 1055
52
+ USERNAME 1.00 0.99 0.99 1410
53
+ VEHICLEVIN 0.98 0.99 0.98 352
54
+ VEHICLEVRM 0.99 0.99 0.99 386
55
+ VEHICLE_ID 0.96 0.99 0.98 83
56
+
57
+ micro avg 0.95 0.96 0.95 68047
58
+ macro avg 0.94 0.94 0.94 68047
59
+ weighted avg 0.95 0.96 0.95 68047
config.json ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "RobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "id2label": {
15
+ "0": "O",
16
+ "1": "B-ACCOUNTNAME",
17
+ "2": "B-ADDRESS",
18
+ "3": "B-AGE",
19
+ "4": "B-AMOUNT",
20
+ "5": "B-API_KEY",
21
+ "6": "B-BANK_ACCOUNT",
22
+ "7": "B-BANK_ROUTING",
23
+ "8": "B-BIC",
24
+ "9": "B-BITCOINADDRESS",
25
+ "10": "B-COMPANYNAME",
26
+ "11": "B-CREDITCARDISSUER",
27
+ "12": "B-CREDIT_CARD",
28
+ "13": "B-CREDIT_CARD_CVV",
29
+ "14": "B-CREDIT_CARD_EXPIRY",
30
+ "15": "B-CURRENCY",
31
+ "16": "B-CURRENCYCODE",
32
+ "17": "B-CURRENCYNAME",
33
+ "18": "B-CURRENCYSYMBOL",
34
+ "19": "B-DATE",
35
+ "20": "B-DATE_OF_BIRTH",
36
+ "21": "B-DEVICE_ID",
37
+ "22": "B-DRIVER_LICENSE",
38
+ "23": "B-EMAIL",
39
+ "24": "B-ETHEREUMADDRESS",
40
+ "25": "B-EYECOLOR",
41
+ "26": "B-GENDER",
42
+ "27": "B-GPS_COORDINATES",
43
+ "28": "B-HEIGHT",
44
+ "29": "B-IP_ADDRESS",
45
+ "30": "B-JOBAREA",
46
+ "31": "B-JOBTITLE",
47
+ "32": "B-JOBTYPE",
48
+ "33": "B-LITECOINADDRESS",
49
+ "34": "B-MASKEDNUMBER",
50
+ "35": "B-NEARBYGPSCOORDINATE",
51
+ "36": "B-ORDINALDIRECTION",
52
+ "37": "B-ORGANIZATION",
53
+ "38": "B-PASSPORT_NUMBER",
54
+ "39": "B-PASSWORD",
55
+ "40": "B-PERSON_NAME",
56
+ "41": "B-PHONEIMEI",
57
+ "42": "B-PHONE_NUMBER",
58
+ "43": "B-PIN",
59
+ "44": "B-SECONDARYADDRESS",
60
+ "45": "B-SEX",
61
+ "46": "B-SSN",
62
+ "47": "B-TIME",
63
+ "48": "B-URL",
64
+ "49": "B-USERAGENT",
65
+ "50": "B-USERNAME",
66
+ "51": "B-VEHICLEVIN",
67
+ "52": "B-VEHICLEVRM",
68
+ "53": "B-VEHICLE_ID",
69
+ "54": "I-ACCOUNTNAME",
70
+ "55": "I-ADDRESS",
71
+ "56": "I-AGE",
72
+ "57": "I-AMOUNT",
73
+ "58": "I-COMPANYNAME",
74
+ "59": "I-CURRENCY",
75
+ "60": "I-CURRENCYNAME",
76
+ "61": "I-DATE",
77
+ "62": "I-DATE_OF_BIRTH",
78
+ "63": "I-EYECOLOR",
79
+ "64": "I-GENDER",
80
+ "65": "I-GPS_COORDINATES",
81
+ "66": "I-HEIGHT",
82
+ "67": "I-JOBTITLE",
83
+ "68": "I-ORGANIZATION",
84
+ "69": "I-PERSON_NAME",
85
+ "70": "I-PHONE_NUMBER",
86
+ "71": "I-SECONDARYADDRESS",
87
+ "72": "I-SSN",
88
+ "73": "I-TIME",
89
+ "74": "I-USERAGENT"
90
+ },
91
+ "initializer_range": 0.02,
92
+ "intermediate_size": 3072,
93
+ "is_decoder": false,
94
+ "label2id": {
95
+ "B-ACCOUNTNAME": 1,
96
+ "B-ADDRESS": 2,
97
+ "B-AGE": 3,
98
+ "B-AMOUNT": 4,
99
+ "B-API_KEY": 5,
100
+ "B-BANK_ACCOUNT": 6,
101
+ "B-BANK_ROUTING": 7,
102
+ "B-BIC": 8,
103
+ "B-BITCOINADDRESS": 9,
104
+ "B-COMPANYNAME": 10,
105
+ "B-CREDITCARDISSUER": 11,
106
+ "B-CREDIT_CARD": 12,
107
+ "B-CREDIT_CARD_CVV": 13,
108
+ "B-CREDIT_CARD_EXPIRY": 14,
109
+ "B-CURRENCY": 15,
110
+ "B-CURRENCYCODE": 16,
111
+ "B-CURRENCYNAME": 17,
112
+ "B-CURRENCYSYMBOL": 18,
113
+ "B-DATE": 19,
114
+ "B-DATE_OF_BIRTH": 20,
115
+ "B-DEVICE_ID": 21,
116
+ "B-DRIVER_LICENSE": 22,
117
+ "B-EMAIL": 23,
118
+ "B-ETHEREUMADDRESS": 24,
119
+ "B-EYECOLOR": 25,
120
+ "B-GENDER": 26,
121
+ "B-GPS_COORDINATES": 27,
122
+ "B-HEIGHT": 28,
123
+ "B-IP_ADDRESS": 29,
124
+ "B-JOBAREA": 30,
125
+ "B-JOBTITLE": 31,
126
+ "B-JOBTYPE": 32,
127
+ "B-LITECOINADDRESS": 33,
128
+ "B-MASKEDNUMBER": 34,
129
+ "B-NEARBYGPSCOORDINATE": 35,
130
+ "B-ORDINALDIRECTION": 36,
131
+ "B-ORGANIZATION": 37,
132
+ "B-PASSPORT_NUMBER": 38,
133
+ "B-PASSWORD": 39,
134
+ "B-PERSON_NAME": 40,
135
+ "B-PHONEIMEI": 41,
136
+ "B-PHONE_NUMBER": 42,
137
+ "B-PIN": 43,
138
+ "B-SECONDARYADDRESS": 44,
139
+ "B-SEX": 45,
140
+ "B-SSN": 46,
141
+ "B-TIME": 47,
142
+ "B-URL": 48,
143
+ "B-USERAGENT": 49,
144
+ "B-USERNAME": 50,
145
+ "B-VEHICLEVIN": 51,
146
+ "B-VEHICLEVRM": 52,
147
+ "B-VEHICLE_ID": 53,
148
+ "I-ACCOUNTNAME": 54,
149
+ "I-ADDRESS": 55,
150
+ "I-AGE": 56,
151
+ "I-AMOUNT": 57,
152
+ "I-COMPANYNAME": 58,
153
+ "I-CURRENCY": 59,
154
+ "I-CURRENCYNAME": 60,
155
+ "I-DATE": 61,
156
+ "I-DATE_OF_BIRTH": 62,
157
+ "I-EYECOLOR": 63,
158
+ "I-GENDER": 64,
159
+ "I-GPS_COORDINATES": 65,
160
+ "I-HEIGHT": 66,
161
+ "I-JOBTITLE": 67,
162
+ "I-ORGANIZATION": 68,
163
+ "I-PERSON_NAME": 69,
164
+ "I-PHONE_NUMBER": 70,
165
+ "I-SECONDARYADDRESS": 71,
166
+ "I-SSN": 72,
167
+ "I-TIME": 73,
168
+ "I-USERAGENT": 74,
169
+ "O": 0
170
+ },
171
+ "layer_norm_eps": 1e-05,
172
+ "max_position_embeddings": 514,
173
+ "model_type": "roberta",
174
+ "num_attention_heads": 12,
175
+ "num_hidden_layers": 12,
176
+ "pad_token_id": 1,
177
+ "tie_word_embeddings": true,
178
+ "transformers_version": "5.0.0",
179
+ "type_vocab_size": 1,
180
+ "use_cache": false,
181
+ "vocab_size": 50265
182
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<s>",
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "errors": "replace",
8
+ "is_local": false,
9
+ "mask_token": "<mask>",
10
+ "model_max_length": 512,
11
+ "pad_token": "<pad>",
12
+ "sep_token": "</s>",
13
+ "tokenizer_class": "RobertaTokenizer",
14
+ "trim_offsets": true,
15
+ "unk_token": "<unk>"
16
+ }