smoh commited on
Commit
98ca2c3
·
verified ·
1 Parent(s): 6ba9c65

v1.4: epoch 3 best checkpoint (F1=0.889, 41 entity types, 241K training examples)

Browse files
Files changed (3) hide show
  1. README.md +50 -144
  2. config.json +169 -180
  3. model.safetensors +2 -2
README.md CHANGED
@@ -1,178 +1,84 @@
1
  ---
2
- library_name: transformers
3
  license: apache-2.0
4
  language:
5
  - en
6
  tags:
7
- - token-classification
8
- - ner
9
  - pii
 
 
10
  - privacy
11
  - deberta
12
- - crf
13
  datasets:
14
- - ai4privacy/internationalised_pii_dataset
 
 
15
  - gretelai/gretel-pii-masking-en-v1
 
 
 
 
16
  pipeline_tag: token-classification
17
  model-index:
18
- - name: datafog-pii-small-en
19
  results:
20
  - task:
21
  type: token-classification
22
- name: Named Entity Recognition
23
  metrics:
24
- - type: f1
25
- value: 0.9071
26
- name: Overall F1
27
- - type: precision
28
- value: 0.8981
29
- name: Overall Precision
30
- - type: recall
31
- value: 0.9162
32
- name: Overall Recall
33
  ---
34
 
35
- # DataFog PII-NER v1.3
36
 
37
- A lightweight token classification model for detecting **Personally Identifiable Information (PII)** in English text. Built on DeBERTa-v3-xsmall with character-level CNN features and a CRF decoding head for structured BIO tag prediction.
38
-
39
- **v1.3** is the fourth iteration, achieving the best overall F1 (0.9071) across all versions through early backbone freezing and progressive tier weight reduction.
40
-
41
- ## Model Details
42
-
43
- | Property | Value |
44
- |----------|-------|
45
- | Architecture | DeBERTa-v3-xsmall + CharCNN + GatingFusion + CRF |
46
- | Parameters | ~22.7M total |
47
- | Labels | 89 BIO tags (44 entity types) |
48
- | Max sequence length | 256 tokens |
49
- | Training data | ~169K examples from 3 datasets (with Tier 1 oversampling) |
50
- | Training hardware | NVIDIA H100 PCIe (80GB), BF16 mixed precision |
51
- | Training time | 20 hours (10 epochs) |
52
- | Framework | Transformers 4.49, PyTorch 2.7 |
53
 
54
  ## Architecture
55
 
 
 
 
 
56
 
 
57
 
58
- The **CharCNN** captures structural PII patterns (SSN: XXX-XX-XXXX, credit cards: XXXX-XXXX-XXXX-XXXX) while **DeBERTa** provides contextual understanding. The **gating fusion** dynamically weights character vs. contextual features per token. The **CRF head** enforces valid BIO tag sequences at the sequence level.
59
-
60
- ## Supported Entity Types (44 types, 4 tiers)
61
-
62
- ### Tier 1 -- Critical PII (target: 0.98 recall)
63
- SSN, Credit Card, Bank Account, Passport Number, Drivers License, Tax ID
64
-
65
- ### Tier 2 -- High Sensitivity (target: 0.95 recall)
66
- Person, Email, Phone, Date of Birth, Street Address, IP Address
67
-
68
- ### Tier 3 -- Moderate Sensitivity (target: 0.90 recall)
69
- Username, Date, Location, Organization, URL, License Plate, Age, Nationality, Gender, Ethnicity, Religion, Marital Status
70
-
71
- ### Tier 4 -- Domain-Specific (target: 0.85 recall)
72
- Medical Record, Employee ID, Student ID, Account Number, PIN, Password, Biometric, Vehicle ID, Device ID, Crypto Wallet, IBAN, Swift Code, Insurance Number, Salary, Criminal Record, Political Affiliation, Sexual Orientation, Health Condition, Genetic Data, Trade Union
73
-
74
- ## Test Set Results
75
-
76
- ### Overall Metrics
77
-
78
- | Metric | V1.3 | V1.2 | V1.1 | V1 |
79
- |--------|------|------|------|-----|
80
- | **Overall F1** | **0.9071** | 0.9005 | 0.9005 | 0.904 |
81
- | Precision | 0.8981 | 0.9050 | 0.9062 | 0.907 |
82
- | **Recall** | **0.9162** | 0.8960 | 0.8950 | 0.902 |
83
-
84
- ### Tier Recall
85
-
86
- | Tier | V1.3 | V1.2 | Target | Status |
87
- |------|------|------|--------|--------|
88
- | Tier 1 (Critical) | 0.823 | 0.841 | 0.98 | FAIL |
89
- | Tier 2 (High) | **0.945** | 0.936 | 0.95 | FAIL |
90
- | Tier 3 (Moderate) | **0.930** | 0.911 | 0.90 | PASS |
91
- | Tier 4 (Domain) | **0.868** | 0.845 | 0.85 | PASS |
92
-
93
- ### Per-Entity F1 (Top 20)
94
-
95
- | Entity Type | F1 |
96
- |-------------|------|
97
- | URL | 0.994 |
98
- | Biometric | 0.992 |
99
- | IP Address | 0.988 |
100
- | Date of Birth | 0.981 |
101
- | Vehicle ID | 0.976 |
102
- | Email | 0.968 |
103
- | Phone | 0.966 |
104
- | License Plate | 0.952 |
105
- | Gender | 0.946 |
106
- | Employee ID | 0.940 |
107
- | IBAN | 0.935 |
108
- | Username | 0.930 |
109
- | SSN | 0.930 |
110
- | Location | 0.929 |
111
- | Account Number | 0.923 |
112
- | Organization | 0.902 |
113
- | Drivers License | 0.881 |
114
- | Password | 0.880 |
115
- | Date | 0.877 |
116
- | Person | 0.875 |
117
-
118
- ## Training Details
119
-
120
- ### V1.3 Approach: Early Freeze + Progressive Tier Weights
121
-
122
- Two key innovations based on learnings from V1-V1.2:
123
-
124
- 1. **Backbone freeze after epoch 3**: DeBERTa weights are frozen after epoch 3 to preserve clean representations before training instability occurs.
125
-
126
- 2. **Progressive tier weight reduction**: CRF loss weights start at 3x/2x/1.5x/1x (Tier 1-4) for epochs 1-2, then reduce to 2x/1.5x/1.25x/1x from epoch 3 onward. This limits gradient amplification buildup while giving a strong initial learning signal.
127
-
128
- ### Hyperparameters
129
-
130
- | Parameter | Value |
131
- |-----------|-------|
132
- | Backbone LR | 1e-5 (with AdamW eps=1.0) |
133
- | Head LR | 1e-3 (100x faster) |
134
- | LR Schedule | Cosine |
135
- | Warmup | 500 steps |
136
- | Epochs | 10 (3 full + 7 head-only) |
137
- | Effective batch size | 32 (8 x 4 gradient accumulation) |
138
- | Mixed precision | BF16 |
139
- | Best checkpoint | Epoch 3 |
140
-
141
- ### Training Data
142
-
143
- ~169K examples from three open-licensed datasets:
144
- - [AI4Privacy PII Dataset](https://huggingface.co/datasets/ai4privacy/internationalised_pii_dataset) (~43K English examples, Apache 2.0)
145
- - [NVIDIA Nemotron PII](https://huggingface.co/datasets/ai4privacy/pii-masking-400k) (~100K examples, CC-BY-4.0)
146
- - [Gretel PII Masking](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1) (~26K examples, Apache 2.0)
147
-
148
- Tier 1 entity examples are oversampled 3x to address the 323x frequency imbalance between common entities (DATE: 170K) and rare critical entities (PASSPORT: 526).
149
-
150
- ## Version History
151
-
152
- | Version | F1 | Tier 1 Recall | Key Change |
153
- |---------|------|--------------|------------|
154
- | V1 | 0.904 | 0.722 | Baseline |
155
- | V1.1 | 0.9005 | 0.771 | Tier-weighted loss + oversampling |
156
- | V1.2 | 0.9005 | 0.841 | Backbone freeze after epoch 4 |
157
- | **V1.3** | **0.907** | 0.823 | Early freeze (epoch 3) + progressive tier weights |
158
 
159
- ## Limitations
160
 
161
- - Tier 1 recall (0.823) is below the 0.98 target -- critical PII types like Passport Number (only 526 training examples) remain challenging
162
- - 16 entity types have zero training examples (Nationality, Ethnicity, Religion, etc.) and cannot be detected
163
- - English-only
164
- - Max 256 tokens per input (longer documents need chunking)
165
- - Custom architecture requires the source code for loading
 
166
 
167
- ## Links
168
 
169
- - **Code**: [github.com/DataFog/datafog-labs](https://github.com/DataFog/datafog-labs)
170
- - **Training Chronicle**: [Full training log](https://github.com/DataFog/datafog-labs/blob/main/pii-ner-v1/docs/training_chronicle.md)
171
- - **WandB Run**: [V1.3 training metrics](https://wandb.ai/datafog/huggingface/runs/a66aw6sb)
172
 
173
- ## Citation
 
174
 
 
 
175
 
 
 
176
 
177
  ## License
178
 
 
1
  ---
 
2
  license: apache-2.0
3
  language:
4
  - en
5
  tags:
 
 
6
  - pii
7
+ - ner
8
+ - token-classification
9
  - privacy
10
  - deberta
 
11
  datasets:
12
+ - ai4privacy/pii-masking-200k
13
+ - nvidia/Nemotron-PII
14
+ - gretelai/synthetic_pii_finance_multilingual
15
  - gretelai/gretel-pii-masking-en-v1
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
  pipeline_tag: token-classification
21
  model-index:
22
+ - name: pii-small-en
23
  results:
24
  - task:
25
  type: token-classification
26
+ name: PII NER
27
  metrics:
28
+ - name: F1
29
+ type: f1
30
+ value: 0.8894
31
+ - name: Precision
32
+ type: precision
33
+ value: 0.8698
34
+ - name: Recall
35
+ type: recall
36
+ value: 0.9098
37
  ---
38
 
39
+ # DataFog PII-Small-EN (v1.4)
40
 
41
+ A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Architecture
44
 
45
+ - **Backbone:** DeBERTa-v3-xsmall (22M params)
46
+ - **Head:** CharCNN + CRF
47
+ - **Total params:** ~45M
48
+ - **Entity types:** 41 PII categories across 4 sensitivity tiers
49
 
50
+ ## Performance (v1.4, Epoch 3)
51
 
52
+ | Tier | Recall | Target |
53
+ |------|--------|--------|
54
+ | T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
55
+ | T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
56
+ | T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
57
+ | T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
58
+ | **Overall F1** | **0.889** | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ ## Training Data
61
 
62
+ - AI4Privacy (~43K examples, English subset)
63
+ - NVIDIA Nemotron-PII (100K examples)
64
+ - Gretel Synthetic PII Finance (26K examples)
65
+ - Gretel PII Masking EN v1 (50K examples)
66
+ - Synthetic data for rare entity types (22K examples)
67
+ - **Total: ~241K examples**
68
 
69
+ ## Entity Types (41)
70
 
71
+ ### Tier 1 — Critical PII
72
+ SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID
 
73
 
74
+ ### Tier 2 — High Sensitivity
75
+ PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS
76
 
77
+ ### Tier 3 — Moderate Sensitivity
78
+ USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS
79
 
80
+ ### Tier 4 — Domain-Specific
81
+ MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION
82
 
83
  ## License
84
 
config.json CHANGED
@@ -17,189 +17,178 @@
17
  "char_vocab_size": 256,
18
  "dropout": 0.1,
19
  "id2label": {
20
- "0": "LABEL_0",
21
- "1": "LABEL_1",
22
- "2": "LABEL_2",
23
- "3": "LABEL_3",
24
- "4": "LABEL_4",
25
- "5": "LABEL_5",
26
- "6": "LABEL_6",
27
- "7": "LABEL_7",
28
- "8": "LABEL_8",
29
- "9": "LABEL_9",
30
- "10": "LABEL_10",
31
- "11": "LABEL_11",
32
- "12": "LABEL_12",
33
- "13": "LABEL_13",
34
- "14": "LABEL_14",
35
- "15": "LABEL_15",
36
- "16": "LABEL_16",
37
- "17": "LABEL_17",
38
- "18": "LABEL_18",
39
- "19": "LABEL_19",
40
- "20": "LABEL_20",
41
- "21": "LABEL_21",
42
- "22": "LABEL_22",
43
- "23": "LABEL_23",
44
- "24": "LABEL_24",
45
- "25": "LABEL_25",
46
- "26": "LABEL_26",
47
- "27": "LABEL_27",
48
- "28": "LABEL_28",
49
- "29": "LABEL_29",
50
- "30": "LABEL_30",
51
- "31": "LABEL_31",
52
- "32": "LABEL_32",
53
- "33": "LABEL_33",
54
- "34": "LABEL_34",
55
- "35": "LABEL_35",
56
- "36": "LABEL_36",
57
- "37": "LABEL_37",
58
- "38": "LABEL_38",
59
- "39": "LABEL_39",
60
- "40": "LABEL_40",
61
- "41": "LABEL_41",
62
- "42": "LABEL_42",
63
- "43": "LABEL_43",
64
- "44": "LABEL_44",
65
- "45": "LABEL_45",
66
- "46": "LABEL_46",
67
- "47": "LABEL_47",
68
- "48": "LABEL_48",
69
- "49": "LABEL_49",
70
- "50": "LABEL_50",
71
- "51": "LABEL_51",
72
- "52": "LABEL_52",
73
- "53": "LABEL_53",
74
- "54": "LABEL_54",
75
- "55": "LABEL_55",
76
- "56": "LABEL_56",
77
- "57": "LABEL_57",
78
- "58": "LABEL_58",
79
- "59": "LABEL_59",
80
- "60": "LABEL_60",
81
- "61": "LABEL_61",
82
- "62": "LABEL_62",
83
- "63": "LABEL_63",
84
- "64": "LABEL_64",
85
- "65": "LABEL_65",
86
- "66": "LABEL_66",
87
- "67": "LABEL_67",
88
- "68": "LABEL_68",
89
- "69": "LABEL_69",
90
- "70": "LABEL_70",
91
- "71": "LABEL_71",
92
- "72": "LABEL_72",
93
- "73": "LABEL_73",
94
- "74": "LABEL_74",
95
- "75": "LABEL_75",
96
- "76": "LABEL_76",
97
- "77": "LABEL_77",
98
- "78": "LABEL_78",
99
- "79": "LABEL_79",
100
- "80": "LABEL_80",
101
- "81": "LABEL_81",
102
- "82": "LABEL_82",
103
- "83": "LABEL_83",
104
- "84": "LABEL_84",
105
- "85": "LABEL_85",
106
- "86": "LABEL_86",
107
- "87": "LABEL_87",
108
- "88": "LABEL_88"
109
  },
110
  "label2id": {
111
- "LABEL_0": 0,
112
- "LABEL_1": 1,
113
- "LABEL_10": 10,
114
- "LABEL_11": 11,
115
- "LABEL_12": 12,
116
- "LABEL_13": 13,
117
- "LABEL_14": 14,
118
- "LABEL_15": 15,
119
- "LABEL_16": 16,
120
- "LABEL_17": 17,
121
- "LABEL_18": 18,
122
- "LABEL_19": 19,
123
- "LABEL_2": 2,
124
- "LABEL_20": 20,
125
- "LABEL_21": 21,
126
- "LABEL_22": 22,
127
- "LABEL_23": 23,
128
- "LABEL_24": 24,
129
- "LABEL_25": 25,
130
- "LABEL_26": 26,
131
- "LABEL_27": 27,
132
- "LABEL_28": 28,
133
- "LABEL_29": 29,
134
- "LABEL_3": 3,
135
- "LABEL_30": 30,
136
- "LABEL_31": 31,
137
- "LABEL_32": 32,
138
- "LABEL_33": 33,
139
- "LABEL_34": 34,
140
- "LABEL_35": 35,
141
- "LABEL_36": 36,
142
- "LABEL_37": 37,
143
- "LABEL_38": 38,
144
- "LABEL_39": 39,
145
- "LABEL_4": 4,
146
- "LABEL_40": 40,
147
- "LABEL_41": 41,
148
- "LABEL_42": 42,
149
- "LABEL_43": 43,
150
- "LABEL_44": 44,
151
- "LABEL_45": 45,
152
- "LABEL_46": 46,
153
- "LABEL_47": 47,
154
- "LABEL_48": 48,
155
- "LABEL_49": 49,
156
- "LABEL_5": 5,
157
- "LABEL_50": 50,
158
- "LABEL_51": 51,
159
- "LABEL_52": 52,
160
- "LABEL_53": 53,
161
- "LABEL_54": 54,
162
- "LABEL_55": 55,
163
- "LABEL_56": 56,
164
- "LABEL_57": 57,
165
- "LABEL_58": 58,
166
- "LABEL_59": 59,
167
- "LABEL_6": 6,
168
- "LABEL_60": 60,
169
- "LABEL_61": 61,
170
- "LABEL_62": 62,
171
- "LABEL_63": 63,
172
- "LABEL_64": 64,
173
- "LABEL_65": 65,
174
- "LABEL_66": 66,
175
- "LABEL_67": 67,
176
- "LABEL_68": 68,
177
- "LABEL_69": 69,
178
- "LABEL_7": 7,
179
- "LABEL_70": 70,
180
- "LABEL_71": 71,
181
- "LABEL_72": 72,
182
- "LABEL_73": 73,
183
- "LABEL_74": 74,
184
- "LABEL_75": 75,
185
- "LABEL_76": 76,
186
- "LABEL_77": 77,
187
- "LABEL_78": 78,
188
- "LABEL_79": 79,
189
- "LABEL_8": 8,
190
- "LABEL_80": 80,
191
- "LABEL_81": 81,
192
- "LABEL_82": 82,
193
- "LABEL_83": 83,
194
- "LABEL_84": 84,
195
- "LABEL_85": 85,
196
- "LABEL_86": 86,
197
- "LABEL_87": 87,
198
- "LABEL_88": 88,
199
- "LABEL_9": 9
200
  },
201
  "max_char_len": 20,
202
  "model_type": "pii_ner",
203
  "torch_dtype": "float32",
204
- "transformers_version": "4.49.0"
205
- }
 
 
17
  "char_vocab_size": 256,
18
  "dropout": 0.1,
19
  "id2label": {
20
+ "0": "O",
21
+ "1": "B-SSN",
22
+ "2": "I-SSN",
23
+ "3": "B-CREDIT_CARD",
24
+ "4": "I-CREDIT_CARD",
25
+ "5": "B-BANK_ACCOUNT",
26
+ "6": "I-BANK_ACCOUNT",
27
+ "7": "B-PASSPORT_NUMBER",
28
+ "8": "I-PASSPORT_NUMBER",
29
+ "9": "B-DRIVERS_LICENSE",
30
+ "10": "I-DRIVERS_LICENSE",
31
+ "11": "B-TAX_ID",
32
+ "12": "I-TAX_ID",
33
+ "13": "B-PERSON",
34
+ "14": "I-PERSON",
35
+ "15": "B-EMAIL",
36
+ "16": "I-EMAIL",
37
+ "17": "B-PHONE",
38
+ "18": "I-PHONE",
39
+ "19": "B-DATE_OF_BIRTH",
40
+ "20": "I-DATE_OF_BIRTH",
41
+ "21": "B-STREET_ADDRESS",
42
+ "22": "I-STREET_ADDRESS",
43
+ "23": "B-IP_ADDRESS",
44
+ "24": "I-IP_ADDRESS",
45
+ "25": "B-USERNAME",
46
+ "26": "I-USERNAME",
47
+ "27": "B-DATE",
48
+ "28": "I-DATE",
49
+ "29": "B-LOCATION",
50
+ "30": "I-LOCATION",
51
+ "31": "B-ORGANIZATION",
52
+ "32": "I-ORGANIZATION",
53
+ "33": "B-URL",
54
+ "34": "I-URL",
55
+ "35": "B-LICENSE_PLATE",
56
+ "36": "I-LICENSE_PLATE",
57
+ "37": "B-AGE",
58
+ "38": "I-AGE",
59
+ "39": "B-NATIONALITY",
60
+ "40": "I-NATIONALITY",
61
+ "41": "B-GENDER",
62
+ "42": "I-GENDER",
63
+ "43": "B-RELIGION",
64
+ "44": "I-RELIGION",
65
+ "45": "B-MARITAL_STATUS",
66
+ "46": "I-MARITAL_STATUS",
67
+ "47": "B-MEDICAL_RECORD",
68
+ "48": "I-MEDICAL_RECORD",
69
+ "49": "B-EMPLOYEE_ID",
70
+ "50": "I-EMPLOYEE_ID",
71
+ "51": "B-STUDENT_ID",
72
+ "52": "I-STUDENT_ID",
73
+ "53": "B-ACCOUNT_NUMBER",
74
+ "54": "I-ACCOUNT_NUMBER",
75
+ "55": "B-PIN",
76
+ "56": "I-PIN",
77
+ "57": "B-PASSWORD",
78
+ "58": "I-PASSWORD",
79
+ "59": "B-BIOMETRIC",
80
+ "60": "I-BIOMETRIC",
81
+ "61": "B-VEHICLE_ID",
82
+ "62": "I-VEHICLE_ID",
83
+ "63": "B-DEVICE_ID",
84
+ "64": "I-DEVICE_ID",
85
+ "65": "B-CRYPTO_WALLET",
86
+ "66": "I-CRYPTO_WALLET",
87
+ "67": "B-IBAN",
88
+ "68": "I-IBAN",
89
+ "69": "B-SWIFT_CODE",
90
+ "70": "I-SWIFT_CODE",
91
+ "71": "B-INSURANCE_NUMBER",
92
+ "72": "I-INSURANCE_NUMBER",
93
+ "73": "B-SALARY",
94
+ "74": "I-SALARY",
95
+ "75": "B-CRIMINAL_RECORD",
96
+ "76": "I-CRIMINAL_RECORD",
97
+ "77": "B-POLITICAL_AFFILIATION",
98
+ "78": "I-POLITICAL_AFFILIATION",
99
+ "79": "B-SEXUAL_ORIENTATION",
100
+ "80": "I-SEXUAL_ORIENTATION",
101
+ "81": "B-HEALTH_CONDITION",
102
+ "82": "I-HEALTH_CONDITION"
 
 
 
 
 
 
103
  },
104
  "label2id": {
105
+ "O": 0,
106
+ "B-SSN": 1,
107
+ "I-SSN": 2,
108
+ "B-CREDIT_CARD": 3,
109
+ "I-CREDIT_CARD": 4,
110
+ "B-BANK_ACCOUNT": 5,
111
+ "I-BANK_ACCOUNT": 6,
112
+ "B-PASSPORT_NUMBER": 7,
113
+ "I-PASSPORT_NUMBER": 8,
114
+ "B-DRIVERS_LICENSE": 9,
115
+ "I-DRIVERS_LICENSE": 10,
116
+ "B-TAX_ID": 11,
117
+ "I-TAX_ID": 12,
118
+ "B-PERSON": 13,
119
+ "I-PERSON": 14,
120
+ "B-EMAIL": 15,
121
+ "I-EMAIL": 16,
122
+ "B-PHONE": 17,
123
+ "I-PHONE": 18,
124
+ "B-DATE_OF_BIRTH": 19,
125
+ "I-DATE_OF_BIRTH": 20,
126
+ "B-STREET_ADDRESS": 21,
127
+ "I-STREET_ADDRESS": 22,
128
+ "B-IP_ADDRESS": 23,
129
+ "I-IP_ADDRESS": 24,
130
+ "B-USERNAME": 25,
131
+ "I-USERNAME": 26,
132
+ "B-DATE": 27,
133
+ "I-DATE": 28,
134
+ "B-LOCATION": 29,
135
+ "I-LOCATION": 30,
136
+ "B-ORGANIZATION": 31,
137
+ "I-ORGANIZATION": 32,
138
+ "B-URL": 33,
139
+ "I-URL": 34,
140
+ "B-LICENSE_PLATE": 35,
141
+ "I-LICENSE_PLATE": 36,
142
+ "B-AGE": 37,
143
+ "I-AGE": 38,
144
+ "B-NATIONALITY": 39,
145
+ "I-NATIONALITY": 40,
146
+ "B-GENDER": 41,
147
+ "I-GENDER": 42,
148
+ "B-RELIGION": 43,
149
+ "I-RELIGION": 44,
150
+ "B-MARITAL_STATUS": 45,
151
+ "I-MARITAL_STATUS": 46,
152
+ "B-MEDICAL_RECORD": 47,
153
+ "I-MEDICAL_RECORD": 48,
154
+ "B-EMPLOYEE_ID": 49,
155
+ "I-EMPLOYEE_ID": 50,
156
+ "B-STUDENT_ID": 51,
157
+ "I-STUDENT_ID": 52,
158
+ "B-ACCOUNT_NUMBER": 53,
159
+ "I-ACCOUNT_NUMBER": 54,
160
+ "B-PIN": 55,
161
+ "I-PIN": 56,
162
+ "B-PASSWORD": 57,
163
+ "I-PASSWORD": 58,
164
+ "B-BIOMETRIC": 59,
165
+ "I-BIOMETRIC": 60,
166
+ "B-VEHICLE_ID": 61,
167
+ "I-VEHICLE_ID": 62,
168
+ "B-DEVICE_ID": 63,
169
+ "I-DEVICE_ID": 64,
170
+ "B-CRYPTO_WALLET": 65,
171
+ "I-CRYPTO_WALLET": 66,
172
+ "B-IBAN": 67,
173
+ "I-IBAN": 68,
174
+ "B-SWIFT_CODE": 69,
175
+ "I-SWIFT_CODE": 70,
176
+ "B-INSURANCE_NUMBER": 71,
177
+ "I-INSURANCE_NUMBER": 72,
178
+ "B-SALARY": 73,
179
+ "I-SALARY": 74,
180
+ "B-CRIMINAL_RECORD": 75,
181
+ "I-CRIMINAL_RECORD": 76,
182
+ "B-POLITICAL_AFFILIATION": 77,
183
+ "I-POLITICAL_AFFILIATION": 78,
184
+ "B-SEXUAL_ORIENTATION": 79,
185
+ "I-SEXUAL_ORIENTATION": 80,
186
+ "B-HEALTH_CONDITION": 81,
187
+ "I-HEALTH_CONDITION": 82
 
 
 
 
 
 
188
  },
189
  "max_char_len": 20,
190
  "model_type": "pii_ner",
191
  "torch_dtype": "float32",
192
+ "transformers_version": "4.45.2",
193
+ "num_labels": 83
194
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:de2fdce39e2098e4936063aa68caa580739a451bfdb7988298d5870df64db5f9
3
- size 284508924
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8dbbaa83a0b63f307452ad2f790640d8e3d42aee568feb806f7229e250616808
3
+ size 284495484