v1.4: epoch 3 best checkpoint (F1=0.889, 41 entity types, 241K training examples)
Browse files- README.md +50 -144
- config.json +169 -180
- model.safetensors +2 -2
README.md
CHANGED
|
@@ -1,178 +1,84 @@
|
|
| 1 |
---
|
| 2 |
-
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
tags:
|
| 7 |
-
- token-classification
|
| 8 |
-
- ner
|
| 9 |
- pii
|
|
|
|
|
|
|
| 10 |
- privacy
|
| 11 |
- deberta
|
| 12 |
-
- crf
|
| 13 |
datasets:
|
| 14 |
-
- ai4privacy/
|
|
|
|
|
|
|
| 15 |
- gretelai/gretel-pii-masking-en-v1
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
pipeline_tag: token-classification
|
| 17 |
model-index:
|
| 18 |
-
- name:
|
| 19 |
results:
|
| 20 |
- task:
|
| 21 |
type: token-classification
|
| 22 |
-
name:
|
| 23 |
metrics:
|
| 24 |
-
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
---
|
| 34 |
|
| 35 |
-
# DataFog PII-
|
| 36 |
|
| 37 |
-
A
|
| 38 |
-
|
| 39 |
-
**v1.3** is the fourth iteration, achieving the best overall F1 (0.9071) across all versions through early backbone freezing and progressive tier weight reduction.
|
| 40 |
-
|
| 41 |
-
## Model Details
|
| 42 |
-
|
| 43 |
-
| Property | Value |
|
| 44 |
-
|----------|-------|
|
| 45 |
-
| Architecture | DeBERTa-v3-xsmall + CharCNN + GatingFusion + CRF |
|
| 46 |
-
| Parameters | ~22.7M total |
|
| 47 |
-
| Labels | 89 BIO tags (44 entity types) |
|
| 48 |
-
| Max sequence length | 256 tokens |
|
| 49 |
-
| Training data | ~169K examples from 3 datasets (with Tier 1 oversampling) |
|
| 50 |
-
| Training hardware | NVIDIA H100 PCIe (80GB), BF16 mixed precision |
|
| 51 |
-
| Training time | 20 hours (10 epochs) |
|
| 52 |
-
| Framework | Transformers 4.49, PyTorch 2.7 |
|
| 53 |
|
| 54 |
## Architecture
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
### Tier 2 -- High Sensitivity (target: 0.95 recall)
|
| 66 |
-
Person, Email, Phone, Date of Birth, Street Address, IP Address
|
| 67 |
-
|
| 68 |
-
### Tier 3 -- Moderate Sensitivity (target: 0.90 recall)
|
| 69 |
-
Username, Date, Location, Organization, URL, License Plate, Age, Nationality, Gender, Ethnicity, Religion, Marital Status
|
| 70 |
-
|
| 71 |
-
### Tier 4 -- Domain-Specific (target: 0.85 recall)
|
| 72 |
-
Medical Record, Employee ID, Student ID, Account Number, PIN, Password, Biometric, Vehicle ID, Device ID, Crypto Wallet, IBAN, Swift Code, Insurance Number, Salary, Criminal Record, Political Affiliation, Sexual Orientation, Health Condition, Genetic Data, Trade Union
|
| 73 |
-
|
| 74 |
-
## Test Set Results
|
| 75 |
-
|
| 76 |
-
### Overall Metrics
|
| 77 |
-
|
| 78 |
-
| Metric | V1.3 | V1.2 | V1.1 | V1 |
|
| 79 |
-
|--------|------|------|------|-----|
|
| 80 |
-
| **Overall F1** | **0.9071** | 0.9005 | 0.9005 | 0.904 |
|
| 81 |
-
| Precision | 0.8981 | 0.9050 | 0.9062 | 0.907 |
|
| 82 |
-
| **Recall** | **0.9162** | 0.8960 | 0.8950 | 0.902 |
|
| 83 |
-
|
| 84 |
-
### Tier Recall
|
| 85 |
-
|
| 86 |
-
| Tier | V1.3 | V1.2 | Target | Status |
|
| 87 |
-
|------|------|------|--------|--------|
|
| 88 |
-
| Tier 1 (Critical) | 0.823 | 0.841 | 0.98 | FAIL |
|
| 89 |
-
| Tier 2 (High) | **0.945** | 0.936 | 0.95 | FAIL |
|
| 90 |
-
| Tier 3 (Moderate) | **0.930** | 0.911 | 0.90 | PASS |
|
| 91 |
-
| Tier 4 (Domain) | **0.868** | 0.845 | 0.85 | PASS |
|
| 92 |
-
|
| 93 |
-
### Per-Entity F1 (Top 20)
|
| 94 |
-
|
| 95 |
-
| Entity Type | F1 |
|
| 96 |
-
|-------------|------|
|
| 97 |
-
| URL | 0.994 |
|
| 98 |
-
| Biometric | 0.992 |
|
| 99 |
-
| IP Address | 0.988 |
|
| 100 |
-
| Date of Birth | 0.981 |
|
| 101 |
-
| Vehicle ID | 0.976 |
|
| 102 |
-
| Email | 0.968 |
|
| 103 |
-
| Phone | 0.966 |
|
| 104 |
-
| License Plate | 0.952 |
|
| 105 |
-
| Gender | 0.946 |
|
| 106 |
-
| Employee ID | 0.940 |
|
| 107 |
-
| IBAN | 0.935 |
|
| 108 |
-
| Username | 0.930 |
|
| 109 |
-
| SSN | 0.930 |
|
| 110 |
-
| Location | 0.929 |
|
| 111 |
-
| Account Number | 0.923 |
|
| 112 |
-
| Organization | 0.902 |
|
| 113 |
-
| Drivers License | 0.881 |
|
| 114 |
-
| Password | 0.880 |
|
| 115 |
-
| Date | 0.877 |
|
| 116 |
-
| Person | 0.875 |
|
| 117 |
-
|
| 118 |
-
## Training Details
|
| 119 |
-
|
| 120 |
-
### V1.3 Approach: Early Freeze + Progressive Tier Weights
|
| 121 |
-
|
| 122 |
-
Two key innovations based on learnings from V1-V1.2:
|
| 123 |
-
|
| 124 |
-
1. **Backbone freeze after epoch 3**: DeBERTa weights are frozen after epoch 3 to preserve clean representations before training instability occurs.
|
| 125 |
-
|
| 126 |
-
2. **Progressive tier weight reduction**: CRF loss weights start at 3x/2x/1.5x/1x (Tier 1-4) for epochs 1-2, then reduce to 2x/1.5x/1.25x/1x from epoch 3 onward. This limits gradient amplification buildup while giving a strong initial learning signal.
|
| 127 |
-
|
| 128 |
-
### Hyperparameters
|
| 129 |
-
|
| 130 |
-
| Parameter | Value |
|
| 131 |
-
|-----------|-------|
|
| 132 |
-
| Backbone LR | 1e-5 (with AdamW eps=1.0) |
|
| 133 |
-
| Head LR | 1e-3 (100x faster) |
|
| 134 |
-
| LR Schedule | Cosine |
|
| 135 |
-
| Warmup | 500 steps |
|
| 136 |
-
| Epochs | 10 (3 full + 7 head-only) |
|
| 137 |
-
| Effective batch size | 32 (8 x 4 gradient accumulation) |
|
| 138 |
-
| Mixed precision | BF16 |
|
| 139 |
-
| Best checkpoint | Epoch 3 |
|
| 140 |
-
|
| 141 |
-
### Training Data
|
| 142 |
-
|
| 143 |
-
~169K examples from three open-licensed datasets:
|
| 144 |
-
- [AI4Privacy PII Dataset](https://huggingface.co/datasets/ai4privacy/internationalised_pii_dataset) (~43K English examples, Apache 2.0)
|
| 145 |
-
- [NVIDIA Nemotron PII](https://huggingface.co/datasets/ai4privacy/pii-masking-400k) (~100K examples, CC-BY-4.0)
|
| 146 |
-
- [Gretel PII Masking](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1) (~26K examples, Apache 2.0)
|
| 147 |
-
|
| 148 |
-
Tier 1 entity examples are oversampled 3x to address the 323x frequency imbalance between common entities (DATE: 170K) and rare critical entities (PASSPORT: 526).
|
| 149 |
-
|
| 150 |
-
## Version History
|
| 151 |
-
|
| 152 |
-
| Version | F1 | Tier 1 Recall | Key Change |
|
| 153 |
-
|---------|------|--------------|------------|
|
| 154 |
-
| V1 | 0.904 | 0.722 | Baseline |
|
| 155 |
-
| V1.1 | 0.9005 | 0.771 | Tier-weighted loss + oversampling |
|
| 156 |
-
| V1.2 | 0.9005 | 0.841 | Backbone freeze after epoch 4 |
|
| 157 |
-
| **V1.3** | **0.907** | 0.823 | Early freeze (epoch 3) + progressive tier weights |
|
| 158 |
|
| 159 |
-
##
|
| 160 |
|
| 161 |
-
-
|
| 162 |
-
-
|
| 163 |
-
-
|
| 164 |
-
-
|
| 165 |
-
-
|
|
|
|
| 166 |
|
| 167 |
-
##
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
- **WandB Run**: [V1.3 training metrics](https://wandb.ai/datafog/huggingface/runs/a66aw6sb)
|
| 172 |
|
| 173 |
-
|
|
|
|
| 174 |
|
|
|
|
|
|
|
| 175 |
|
|
|
|
|
|
|
| 176 |
|
| 177 |
## License
|
| 178 |
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
tags:
|
|
|
|
|
|
|
| 6 |
- pii
|
| 7 |
+
- ner
|
| 8 |
+
- token-classification
|
| 9 |
- privacy
|
| 10 |
- deberta
|
|
|
|
| 11 |
datasets:
|
| 12 |
+
- ai4privacy/pii-masking-200k
|
| 13 |
+
- nvidia/Nemotron-PII
|
| 14 |
+
- gretelai/synthetic_pii_finance_multilingual
|
| 15 |
- gretelai/gretel-pii-masking-en-v1
|
| 16 |
+
metrics:
|
| 17 |
+
- f1
|
| 18 |
+
- precision
|
| 19 |
+
- recall
|
| 20 |
pipeline_tag: token-classification
|
| 21 |
model-index:
|
| 22 |
+
- name: pii-small-en
|
| 23 |
results:
|
| 24 |
- task:
|
| 25 |
type: token-classification
|
| 26 |
+
name: PII NER
|
| 27 |
metrics:
|
| 28 |
+
- name: F1
|
| 29 |
+
type: f1
|
| 30 |
+
value: 0.8894
|
| 31 |
+
- name: Precision
|
| 32 |
+
type: precision
|
| 33 |
+
value: 0.8698
|
| 34 |
+
- name: Recall
|
| 35 |
+
type: recall
|
| 36 |
+
value: 0.9098
|
| 37 |
---
|
| 38 |
|
| 39 |
+
# DataFog PII-Small-EN (v1.4)
|
| 40 |
|
| 41 |
+
A compact PII (Personally Identifiable Information) Named Entity Recognition model for English text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
## Architecture
|
| 44 |
|
| 45 |
+
- **Backbone:** DeBERTa-v3-xsmall (22M params)
|
| 46 |
+
- **Head:** CharCNN + CRF
|
| 47 |
+
- **Total params:** ~45M
|
| 48 |
+
- **Entity types:** 41 PII categories across 4 sensitivity tiers
|
| 49 |
|
| 50 |
+
## Performance (v1.4, Epoch 3)
|
| 51 |
|
| 52 |
+
| Tier | Recall | Target |
|
| 53 |
+
|------|--------|--------|
|
| 54 |
+
| T1 (Critical: SSN, Credit Card, etc.) | 0.814 | >=0.98 |
|
| 55 |
+
| T2 (High: Person, Email, Phone, etc.) | 0.937 | >=0.95 |
|
| 56 |
+
| T3 (Moderate: Username, Date, Location, etc.) | 0.945 | >=0.90 |
|
| 57 |
+
| T4 (Domain-specific: Employee ID, Crypto, etc.) | 0.937 | >=0.85 |
|
| 58 |
+
| **Overall F1** | **0.889** | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
## Training Data
|
| 61 |
|
| 62 |
+
- AI4Privacy (~43K examples, English subset)
|
| 63 |
+
- NVIDIA Nemotron-PII (100K examples)
|
| 64 |
+
- Gretel Synthetic PII Finance (26K examples)
|
| 65 |
+
- Gretel PII Masking EN v1 (50K examples)
|
| 66 |
+
- Synthetic data for rare entity types (22K examples)
|
| 67 |
+
- **Total: ~241K examples**
|
| 68 |
|
| 69 |
+
## Entity Types (41)
|
| 70 |
|
| 71 |
+
### Tier 1 — Critical PII
|
| 72 |
+
SSN, CREDIT_CARD, BANK_ACCOUNT, PASSPORT_NUMBER, DRIVERS_LICENSE, TAX_ID
|
|
|
|
| 73 |
|
| 74 |
+
### Tier 2 — High Sensitivity
|
| 75 |
+
PERSON, EMAIL, PHONE, DATE_OF_BIRTH, STREET_ADDRESS, IP_ADDRESS
|
| 76 |
|
| 77 |
+
### Tier 3 — Moderate Sensitivity
|
| 78 |
+
USERNAME, DATE, LOCATION, ORGANIZATION, URL, LICENSE_PLATE, AGE, NATIONALITY, GENDER, RELIGION, MARITAL_STATUS
|
| 79 |
|
| 80 |
+
### Tier 4 — Domain-Specific
|
| 81 |
+
MEDICAL_RECORD, EMPLOYEE_ID, STUDENT_ID, ACCOUNT_NUMBER, PIN, PASSWORD, BIOMETRIC, VEHICLE_ID, DEVICE_ID, CRYPTO_WALLET, IBAN, SWIFT_CODE, INSURANCE_NUMBER, SALARY, CRIMINAL_RECORD, POLITICAL_AFFILIATION, SEXUAL_ORIENTATION, HEALTH_CONDITION
|
| 82 |
|
| 83 |
## License
|
| 84 |
|
config.json
CHANGED
|
@@ -17,189 +17,178 @@
|
|
| 17 |
"char_vocab_size": 256,
|
| 18 |
"dropout": 0.1,
|
| 19 |
"id2label": {
|
| 20 |
-
"0": "
|
| 21 |
-
"1": "
|
| 22 |
-
"2": "
|
| 23 |
-
"3": "
|
| 24 |
-
"4": "
|
| 25 |
-
"5": "
|
| 26 |
-
"6": "
|
| 27 |
-
"7": "
|
| 28 |
-
"8": "
|
| 29 |
-
"9": "
|
| 30 |
-
"10": "
|
| 31 |
-
"11": "
|
| 32 |
-
"12": "
|
| 33 |
-
"13": "
|
| 34 |
-
"14": "
|
| 35 |
-
"15": "
|
| 36 |
-
"16": "
|
| 37 |
-
"17": "
|
| 38 |
-
"18": "
|
| 39 |
-
"19": "
|
| 40 |
-
"20": "
|
| 41 |
-
"21": "
|
| 42 |
-
"22": "
|
| 43 |
-
"23": "
|
| 44 |
-
"24": "
|
| 45 |
-
"25": "
|
| 46 |
-
"26": "
|
| 47 |
-
"27": "
|
| 48 |
-
"28": "
|
| 49 |
-
"29": "
|
| 50 |
-
"30": "
|
| 51 |
-
"31": "
|
| 52 |
-
"32": "
|
| 53 |
-
"33": "
|
| 54 |
-
"34": "
|
| 55 |
-
"35": "
|
| 56 |
-
"36": "
|
| 57 |
-
"37": "
|
| 58 |
-
"38": "
|
| 59 |
-
"39": "
|
| 60 |
-
"40": "
|
| 61 |
-
"41": "
|
| 62 |
-
"42": "
|
| 63 |
-
"43": "
|
| 64 |
-
"44": "
|
| 65 |
-
"45": "
|
| 66 |
-
"46": "
|
| 67 |
-
"47": "
|
| 68 |
-
"48": "
|
| 69 |
-
"49": "
|
| 70 |
-
"50": "
|
| 71 |
-
"51": "
|
| 72 |
-
"52": "
|
| 73 |
-
"53": "
|
| 74 |
-
"54": "
|
| 75 |
-
"55": "
|
| 76 |
-
"56": "
|
| 77 |
-
"57": "
|
| 78 |
-
"58": "
|
| 79 |
-
"59": "
|
| 80 |
-
"60": "
|
| 81 |
-
"61": "
|
| 82 |
-
"62": "
|
| 83 |
-
"63": "
|
| 84 |
-
"64": "
|
| 85 |
-
"65": "
|
| 86 |
-
"66": "
|
| 87 |
-
"67": "
|
| 88 |
-
"68": "
|
| 89 |
-
"69": "
|
| 90 |
-
"70": "
|
| 91 |
-
"71": "
|
| 92 |
-
"72": "
|
| 93 |
-
"73": "
|
| 94 |
-
"74": "
|
| 95 |
-
"75": "
|
| 96 |
-
"76": "
|
| 97 |
-
"77": "
|
| 98 |
-
"78": "
|
| 99 |
-
"79": "
|
| 100 |
-
"80": "
|
| 101 |
-
"81": "
|
| 102 |
-
"82": "
|
| 103 |
-
"83": "LABEL_83",
|
| 104 |
-
"84": "LABEL_84",
|
| 105 |
-
"85": "LABEL_85",
|
| 106 |
-
"86": "LABEL_86",
|
| 107 |
-
"87": "LABEL_87",
|
| 108 |
-
"88": "LABEL_88"
|
| 109 |
},
|
| 110 |
"label2id": {
|
| 111 |
-
"
|
| 112 |
-
"
|
| 113 |
-
"
|
| 114 |
-
"
|
| 115 |
-
"
|
| 116 |
-
"
|
| 117 |
-
"
|
| 118 |
-
"
|
| 119 |
-
"
|
| 120 |
-
"
|
| 121 |
-
"
|
| 122 |
-
"
|
| 123 |
-
"
|
| 124 |
-
"
|
| 125 |
-
"
|
| 126 |
-
"
|
| 127 |
-
"
|
| 128 |
-
"
|
| 129 |
-
"
|
| 130 |
-
"
|
| 131 |
-
"
|
| 132 |
-
"
|
| 133 |
-
"
|
| 134 |
-
"
|
| 135 |
-
"
|
| 136 |
-
"
|
| 137 |
-
"
|
| 138 |
-
"
|
| 139 |
-
"
|
| 140 |
-
"
|
| 141 |
-
"
|
| 142 |
-
"
|
| 143 |
-
"
|
| 144 |
-
"
|
| 145 |
-
"
|
| 146 |
-
"
|
| 147 |
-
"
|
| 148 |
-
"
|
| 149 |
-
"
|
| 150 |
-
"
|
| 151 |
-
"
|
| 152 |
-
"
|
| 153 |
-
"
|
| 154 |
-
"
|
| 155 |
-
"
|
| 156 |
-
"
|
| 157 |
-
"
|
| 158 |
-
"
|
| 159 |
-
"
|
| 160 |
-
"
|
| 161 |
-
"
|
| 162 |
-
"
|
| 163 |
-
"
|
| 164 |
-
"
|
| 165 |
-
"
|
| 166 |
-
"
|
| 167 |
-
"
|
| 168 |
-
"
|
| 169 |
-
"
|
| 170 |
-
"
|
| 171 |
-
"
|
| 172 |
-
"
|
| 173 |
-
"
|
| 174 |
-
"
|
| 175 |
-
"
|
| 176 |
-
"
|
| 177 |
-
"
|
| 178 |
-
"
|
| 179 |
-
"
|
| 180 |
-
"
|
| 181 |
-
"
|
| 182 |
-
"
|
| 183 |
-
"
|
| 184 |
-
"
|
| 185 |
-
"
|
| 186 |
-
"
|
| 187 |
-
"
|
| 188 |
-
"
|
| 189 |
-
"
|
| 190 |
-
"
|
| 191 |
-
"
|
| 192 |
-
"
|
| 193 |
-
"
|
| 194 |
-
"LABEL_84": 84,
|
| 195 |
-
"LABEL_85": 85,
|
| 196 |
-
"LABEL_86": 86,
|
| 197 |
-
"LABEL_87": 87,
|
| 198 |
-
"LABEL_88": 88,
|
| 199 |
-
"LABEL_9": 9
|
| 200 |
},
|
| 201 |
"max_char_len": 20,
|
| 202 |
"model_type": "pii_ner",
|
| 203 |
"torch_dtype": "float32",
|
| 204 |
-
"transformers_version": "4.
|
| 205 |
-
|
|
|
|
|
|
| 17 |
"char_vocab_size": 256,
|
| 18 |
"dropout": 0.1,
|
| 19 |
"id2label": {
|
| 20 |
+
"0": "O",
|
| 21 |
+
"1": "B-SSN",
|
| 22 |
+
"2": "I-SSN",
|
| 23 |
+
"3": "B-CREDIT_CARD",
|
| 24 |
+
"4": "I-CREDIT_CARD",
|
| 25 |
+
"5": "B-BANK_ACCOUNT",
|
| 26 |
+
"6": "I-BANK_ACCOUNT",
|
| 27 |
+
"7": "B-PASSPORT_NUMBER",
|
| 28 |
+
"8": "I-PASSPORT_NUMBER",
|
| 29 |
+
"9": "B-DRIVERS_LICENSE",
|
| 30 |
+
"10": "I-DRIVERS_LICENSE",
|
| 31 |
+
"11": "B-TAX_ID",
|
| 32 |
+
"12": "I-TAX_ID",
|
| 33 |
+
"13": "B-PERSON",
|
| 34 |
+
"14": "I-PERSON",
|
| 35 |
+
"15": "B-EMAIL",
|
| 36 |
+
"16": "I-EMAIL",
|
| 37 |
+
"17": "B-PHONE",
|
| 38 |
+
"18": "I-PHONE",
|
| 39 |
+
"19": "B-DATE_OF_BIRTH",
|
| 40 |
+
"20": "I-DATE_OF_BIRTH",
|
| 41 |
+
"21": "B-STREET_ADDRESS",
|
| 42 |
+
"22": "I-STREET_ADDRESS",
|
| 43 |
+
"23": "B-IP_ADDRESS",
|
| 44 |
+
"24": "I-IP_ADDRESS",
|
| 45 |
+
"25": "B-USERNAME",
|
| 46 |
+
"26": "I-USERNAME",
|
| 47 |
+
"27": "B-DATE",
|
| 48 |
+
"28": "I-DATE",
|
| 49 |
+
"29": "B-LOCATION",
|
| 50 |
+
"30": "I-LOCATION",
|
| 51 |
+
"31": "B-ORGANIZATION",
|
| 52 |
+
"32": "I-ORGANIZATION",
|
| 53 |
+
"33": "B-URL",
|
| 54 |
+
"34": "I-URL",
|
| 55 |
+
"35": "B-LICENSE_PLATE",
|
| 56 |
+
"36": "I-LICENSE_PLATE",
|
| 57 |
+
"37": "B-AGE",
|
| 58 |
+
"38": "I-AGE",
|
| 59 |
+
"39": "B-NATIONALITY",
|
| 60 |
+
"40": "I-NATIONALITY",
|
| 61 |
+
"41": "B-GENDER",
|
| 62 |
+
"42": "I-GENDER",
|
| 63 |
+
"43": "B-RELIGION",
|
| 64 |
+
"44": "I-RELIGION",
|
| 65 |
+
"45": "B-MARITAL_STATUS",
|
| 66 |
+
"46": "I-MARITAL_STATUS",
|
| 67 |
+
"47": "B-MEDICAL_RECORD",
|
| 68 |
+
"48": "I-MEDICAL_RECORD",
|
| 69 |
+
"49": "B-EMPLOYEE_ID",
|
| 70 |
+
"50": "I-EMPLOYEE_ID",
|
| 71 |
+
"51": "B-STUDENT_ID",
|
| 72 |
+
"52": "I-STUDENT_ID",
|
| 73 |
+
"53": "B-ACCOUNT_NUMBER",
|
| 74 |
+
"54": "I-ACCOUNT_NUMBER",
|
| 75 |
+
"55": "B-PIN",
|
| 76 |
+
"56": "I-PIN",
|
| 77 |
+
"57": "B-PASSWORD",
|
| 78 |
+
"58": "I-PASSWORD",
|
| 79 |
+
"59": "B-BIOMETRIC",
|
| 80 |
+
"60": "I-BIOMETRIC",
|
| 81 |
+
"61": "B-VEHICLE_ID",
|
| 82 |
+
"62": "I-VEHICLE_ID",
|
| 83 |
+
"63": "B-DEVICE_ID",
|
| 84 |
+
"64": "I-DEVICE_ID",
|
| 85 |
+
"65": "B-CRYPTO_WALLET",
|
| 86 |
+
"66": "I-CRYPTO_WALLET",
|
| 87 |
+
"67": "B-IBAN",
|
| 88 |
+
"68": "I-IBAN",
|
| 89 |
+
"69": "B-SWIFT_CODE",
|
| 90 |
+
"70": "I-SWIFT_CODE",
|
| 91 |
+
"71": "B-INSURANCE_NUMBER",
|
| 92 |
+
"72": "I-INSURANCE_NUMBER",
|
| 93 |
+
"73": "B-SALARY",
|
| 94 |
+
"74": "I-SALARY",
|
| 95 |
+
"75": "B-CRIMINAL_RECORD",
|
| 96 |
+
"76": "I-CRIMINAL_RECORD",
|
| 97 |
+
"77": "B-POLITICAL_AFFILIATION",
|
| 98 |
+
"78": "I-POLITICAL_AFFILIATION",
|
| 99 |
+
"79": "B-SEXUAL_ORIENTATION",
|
| 100 |
+
"80": "I-SEXUAL_ORIENTATION",
|
| 101 |
+
"81": "B-HEALTH_CONDITION",
|
| 102 |
+
"82": "I-HEALTH_CONDITION"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
},
|
| 104 |
"label2id": {
|
| 105 |
+
"O": 0,
|
| 106 |
+
"B-SSN": 1,
|
| 107 |
+
"I-SSN": 2,
|
| 108 |
+
"B-CREDIT_CARD": 3,
|
| 109 |
+
"I-CREDIT_CARD": 4,
|
| 110 |
+
"B-BANK_ACCOUNT": 5,
|
| 111 |
+
"I-BANK_ACCOUNT": 6,
|
| 112 |
+
"B-PASSPORT_NUMBER": 7,
|
| 113 |
+
"I-PASSPORT_NUMBER": 8,
|
| 114 |
+
"B-DRIVERS_LICENSE": 9,
|
| 115 |
+
"I-DRIVERS_LICENSE": 10,
|
| 116 |
+
"B-TAX_ID": 11,
|
| 117 |
+
"I-TAX_ID": 12,
|
| 118 |
+
"B-PERSON": 13,
|
| 119 |
+
"I-PERSON": 14,
|
| 120 |
+
"B-EMAIL": 15,
|
| 121 |
+
"I-EMAIL": 16,
|
| 122 |
+
"B-PHONE": 17,
|
| 123 |
+
"I-PHONE": 18,
|
| 124 |
+
"B-DATE_OF_BIRTH": 19,
|
| 125 |
+
"I-DATE_OF_BIRTH": 20,
|
| 126 |
+
"B-STREET_ADDRESS": 21,
|
| 127 |
+
"I-STREET_ADDRESS": 22,
|
| 128 |
+
"B-IP_ADDRESS": 23,
|
| 129 |
+
"I-IP_ADDRESS": 24,
|
| 130 |
+
"B-USERNAME": 25,
|
| 131 |
+
"I-USERNAME": 26,
|
| 132 |
+
"B-DATE": 27,
|
| 133 |
+
"I-DATE": 28,
|
| 134 |
+
"B-LOCATION": 29,
|
| 135 |
+
"I-LOCATION": 30,
|
| 136 |
+
"B-ORGANIZATION": 31,
|
| 137 |
+
"I-ORGANIZATION": 32,
|
| 138 |
+
"B-URL": 33,
|
| 139 |
+
"I-URL": 34,
|
| 140 |
+
"B-LICENSE_PLATE": 35,
|
| 141 |
+
"I-LICENSE_PLATE": 36,
|
| 142 |
+
"B-AGE": 37,
|
| 143 |
+
"I-AGE": 38,
|
| 144 |
+
"B-NATIONALITY": 39,
|
| 145 |
+
"I-NATIONALITY": 40,
|
| 146 |
+
"B-GENDER": 41,
|
| 147 |
+
"I-GENDER": 42,
|
| 148 |
+
"B-RELIGION": 43,
|
| 149 |
+
"I-RELIGION": 44,
|
| 150 |
+
"B-MARITAL_STATUS": 45,
|
| 151 |
+
"I-MARITAL_STATUS": 46,
|
| 152 |
+
"B-MEDICAL_RECORD": 47,
|
| 153 |
+
"I-MEDICAL_RECORD": 48,
|
| 154 |
+
"B-EMPLOYEE_ID": 49,
|
| 155 |
+
"I-EMPLOYEE_ID": 50,
|
| 156 |
+
"B-STUDENT_ID": 51,
|
| 157 |
+
"I-STUDENT_ID": 52,
|
| 158 |
+
"B-ACCOUNT_NUMBER": 53,
|
| 159 |
+
"I-ACCOUNT_NUMBER": 54,
|
| 160 |
+
"B-PIN": 55,
|
| 161 |
+
"I-PIN": 56,
|
| 162 |
+
"B-PASSWORD": 57,
|
| 163 |
+
"I-PASSWORD": 58,
|
| 164 |
+
"B-BIOMETRIC": 59,
|
| 165 |
+
"I-BIOMETRIC": 60,
|
| 166 |
+
"B-VEHICLE_ID": 61,
|
| 167 |
+
"I-VEHICLE_ID": 62,
|
| 168 |
+
"B-DEVICE_ID": 63,
|
| 169 |
+
"I-DEVICE_ID": 64,
|
| 170 |
+
"B-CRYPTO_WALLET": 65,
|
| 171 |
+
"I-CRYPTO_WALLET": 66,
|
| 172 |
+
"B-IBAN": 67,
|
| 173 |
+
"I-IBAN": 68,
|
| 174 |
+
"B-SWIFT_CODE": 69,
|
| 175 |
+
"I-SWIFT_CODE": 70,
|
| 176 |
+
"B-INSURANCE_NUMBER": 71,
|
| 177 |
+
"I-INSURANCE_NUMBER": 72,
|
| 178 |
+
"B-SALARY": 73,
|
| 179 |
+
"I-SALARY": 74,
|
| 180 |
+
"B-CRIMINAL_RECORD": 75,
|
| 181 |
+
"I-CRIMINAL_RECORD": 76,
|
| 182 |
+
"B-POLITICAL_AFFILIATION": 77,
|
| 183 |
+
"I-POLITICAL_AFFILIATION": 78,
|
| 184 |
+
"B-SEXUAL_ORIENTATION": 79,
|
| 185 |
+
"I-SEXUAL_ORIENTATION": 80,
|
| 186 |
+
"B-HEALTH_CONDITION": 81,
|
| 187 |
+
"I-HEALTH_CONDITION": 82
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
},
|
| 189 |
"max_char_len": 20,
|
| 190 |
"model_type": "pii_ner",
|
| 191 |
"torch_dtype": "float32",
|
| 192 |
+
"transformers_version": "4.45.2",
|
| 193 |
+
"num_labels": 83
|
| 194 |
+
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8dbbaa83a0b63f307452ad2f790640d8e3d42aee568feb806f7229e250616808
|
| 3 |
+
size 284495484
|