Upload PII detection model OpenMed-PII-ClinicalE5-Base-109M-v1

Browse files

Files changed (11) hide show

README.md +326 -0
all_results.json +24 -0
config.json +241 -0
eval_results.json +11 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
test_results.json +10 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
train_results.json +8 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,326 @@

+---
+language:
+  - en
+license: apache-2.0
+base_model: intfloat/e5-base-v2
+tags:
+  - token-classification
+  - ner
+  - pii
+  - pii-detection
+  - de-identification
+  - privacy
+  - healthcare
+  - medical
+  - clinical
+  - phi
+  - hipaa
+  - pytorch
+  - transformers
+  - openmed
+datasets:
+  - nvidia/Nemotron-PII
+pipeline_tag: token-classification
+library_name: transformers
+metrics:
+  - f1
+  - precision
+  - recall
+model-index:
+  - name: OpenMed-PII-ClinicalE5-Base-109M-v1
+    results:
+      - task:
+          type: token-classification
+          name: Named Entity Recognition
+        dataset:
+          name: nvidia/Nemotron-PII (test_strat)
+          type: nvidia/Nemotron-PII
+          split: test
+        metrics:
+          - type: f1
+            value: 0.9487
+            name: F1 (micro)
+          - type: precision
+            value: 0.9520
+            name: Precision
+          - type: recall
+            value: 0.9454
+            name: Recall
+widget:
+  - text: "Dr. Sarah Johnson (SSN: 123-45-6789) can be reached at sarah.johnson@hospital.org or 555-123-4567. She lives at 123 Oak Street, Boston, MA 02108."
+    example_title: Clinical Note with PII
+---
+# OpenMed-PII-ClinicalE5-Base-109M-v1
+**PII Detection Model** | 109M Parameters | Open Source
+[![F1 Score](https://img.shields.io/badge/F1-94.87%25-brightgreen)]() [![Precision](https://img.shields.io/badge/Precision-95.20%25-blue)]() [![Recall](https://img.shields.io/badge/Recall-94.54%25-orange)]()
+## Model Description
+**OpenMed-PII-ClinicalE5-Base-109M-v1** is a transformer-based token classification model fine-tuned for **Personally Identifiable Information (PII) detection** in text. This model identifies and classifies **54 types of sensitive information** including names, addresses, SSNs, medical record numbers, and more.
+### Key Features
+- **High Accuracy**: Achieves strong F1 scores across diverse PII categories
+- **Comprehensive Coverage**: Detects 50+ entity types spanning personal, financial, medical, and contact information
+- **Privacy-Focused**: Designed for de-identification and compliance with HIPAA, GDPR, and other privacy regulations
+- **Production-Ready**: Optimized for real-world text processing pipelines
+## Performance
+Evaluated on a stratified 2,000-sample test set from NVIDIA Nemotron-PII:
+| Metric | Score |
+|:---|:---:|
+| **Micro F1** | **0.9487** |
+| Precision | 0.9520 |
+| Recall | 0.9454 |
+| Macro F1 | 0.9478 |
+| Weighted F1 | 0.9478 |
+| Accuracy | 0.9931 |
+### Top 10 PII Models
+| Rank | Model | F1 | Precision | Recall |
+|:---:|:---|:---:|:---:|:---:|
+| 1 | [OpenMed-PII-SuperClinical-Large-434M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperClinical-Large-434M-v1) | 0.9608 | 0.9685 | 0.9532 |
+| 2 | [OpenMed-PII-BigMed-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-BigMed-Large-560M-v1) | 0.9604 | 0.9644 | 0.9565 |
+| 3 | [OpenMed-PII-EuroMed-210M-v1](https://huggingface.co/openmed/OpenMed-PII-EuroMed-210M-v1) | 0.9600 | 0.9681 | 0.9521 |
+| 4 | [OpenMed-PII-SnowflakeMed-568M-v1](https://huggingface.co/openmed/OpenMed-PII-SnowflakeMed-568M-v1) | 0.9594 | 0.9640 | 0.9548 |
+| 5 | [OpenMed-PII-SuperMedical-Large-355M-v1](https://huggingface.co/openmed/OpenMed-PII-SuperMedical-Large-355M-v1) | 0.9592 | 0.9632 | 0.9553 |
+| 6 | [OpenMed-PII-ClinicalBGE-568M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalBGE-568M-v1) | 0.9587 | 0.9636 | 0.9538 |
+| 7 | [OpenMed-PII-mClinicalE5-Large-560M-v1](https://huggingface.co/openmed/OpenMed-PII-mClinicalE5-Large-560M-v1) | 0.9582 | 0.9631 | 0.9533 |
+| 8 | [OpenMed-PII-ModernMed-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-ModernMed-Large-395M-v1) | 0.9579 | 0.9639 | 0.9520 |
+| 9 | [OpenMed-PII-BioClinicalModern-Large-395M-v1](https://huggingface.co/openmed/OpenMed-PII-BioClinicalModern-Large-395M-v1) | 0.9579 | 0.9656 | 0.9502 |
+| 10 | [OpenMed-PII-ClinicalE5-Large-335M-v1](https://huggingface.co/openmed/OpenMed-PII-ClinicalE5-Large-335M-v1) | 0.9577 | 0.9604 | 0.9550 |
+### Best Performing Entities
+| Entity | F1 | Precision | Recall | Support |
+|:---|:---:|:---:|:---:|:---:|
+| `biometric_identifier` | 1.000 | 1.000 | 1.000 | 234 |
+| `email` | 0.995 | 0.996 | 0.993 | 763 |
+| `health_plan_beneficiary_number` | 0.993 | 0.986 | 1.000 | 216 |
+| `date_of_birth` | 0.993 | 0.986 | 1.000 | 273 |
+| `blood_type` | 0.993 | 0.985 | 1.000 | 135 |
+### Challenging Entities
+These entity types have lower performance and may benefit from additional post-processing:
+| Entity | F1 | Precision | Recall | Support |
+|:---|:---:|:---:|:---:|:---:|
+| `pin` | 0.872 | 0.892 | 0.853 | 136 |
+| `education_level` | 0.869 | 0.894 | 0.845 | 200 |
+| `sexuality` | 0.857 | 0.815 | 0.904 | 83 |
+| `gender` | 0.797 | 0.744 | 0.858 | 190 |
+| `occupation` | 0.655 | 0.710 | 0.608 | 724 |
+## Supported Entity Types
+This model detects **54 PII entity types** organized into categories:
+<details>
+<summary><strong>Identifiers</strong> (16 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `account_number` | Account Number |
+| `api_key` | Api Key |
+| `bank_routing_number` | Bank Routing Number |
+| `certificate_license_number` | Certificate License Number |
+| `credit_debit_card` | Credit Debit Card |
+| `cvv` | Cvv |
+| `employee_id` | Employee Id |
+| `health_plan_beneficiary_number` | Health Plan Beneficiary Number |
+| `mac_address` | Mac Address |
+| `medical_record_number` | Medical Record Number |
+| ... | *and 6 more* |
+</details>
+<details>
+<summary><strong>Personal Info</strong> (14 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `age` | Age |
+| `biometric_identifier` | Biometric Identifier |
+| `blood_type` | Blood Type |
+| `date_of_birth` | Date Of Birth |
+| `education_level` | Education Level |
+| `first_name` | First Name |
+| `last_name` | Last Name |
+| `gender` | Gender |
+| `language` | Language |
+| `occupation` | Occupation |
+| ... | *and 4 more* |
+</details>
+<details>
+<summary><strong>Contact Info</strong> (4 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `email` | Email |
+| `phone_number` | Phone Number |
+| `fax_number` | Fax Number |
+| `url` | Url |
+</details>
+<details>
+<summary><strong>Location</strong> (6 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `city` | City |
+| `coordinate` | Coordinate |
+| `country` | Country |
+| `county` | County |
+| `state` | State |
+| `street_address` | Street Address |
+</details>
+<details>
+<summary><strong>Network Info</strong> (3 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `device_identifier` | Device Identifier |
+| `ipv4` | Ipv4 |
+| `ipv6` | Ipv6 |
+</details>
+<details>
+<summary><strong>Temporal</strong> (3 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `date` | Date |
+| `date_time` | Date Time |
+| `time` | Time |
+</details>
+<details>
+<summary><strong>Organization</strong> (1 types)</summary>
+| Entity | Description |
+|:---|:---|
+| `company_name` | Company Name |
+</details>
+## Usage
+### Quick Start
+```python
+from transformers import pipeline
+# Load the PII detection pipeline
+ner = pipeline("ner", model="openmed/OpenMed-PII-ClinicalE5-Base-109M-v1", aggregation_strategy="simple")
+text = """
+Patient John Smith (DOB: 03/15/1985, SSN: 123-45-6789) was seen today.
+Contact: john.smith@email.com, Phone: (555) 123-4567.
+Address: 456 Oak Street, Boston, MA 02108.
+"""
+entities = ner(text)
+for entity in entities:
+    print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.3f})")
+```
+### De-identification Example
+```python
+def redact_pii(text, entities, placeholder='[REDACTED]'):
+    """Replace detected PII with placeholders."""
+    # Sort entities by start position (descending) to preserve offsets
+    sorted_entities = sorted(entities, key=lambda x: x['start'], reverse=True)
+    redacted = text
+    for ent in sorted_entities:
+        redacted = redacted[:ent['start']] + f"[{ent['entity_group']}]" + redacted[ent['end']:]
+    return redacted
+# Apply de-identification
+redacted_text = redact_pii(text, entities)
+print(redacted_text)
+```
+### Batch Processing
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+import torch
+model_name = "openmed/OpenMed-PII-ClinicalE5-Base-109M-v1"
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+texts = [
+    "Contact Dr. Jane Doe at jane.doe@hospital.org",
+    "Patient SSN: 987-65-4321, MRN: 12345678",
+]
+inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1)
+```
+## Training Details
+### Dataset
+- **Source**: [NVIDIA Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII)
+- **Format**: BIO-tagged token classification
+- **Labels**: 106 total (53 entity types × 2 BIO tags + O)
+- **Splits**: 50K train / 5K validation / 45K test
+### Training Configuration
+- **Max Sequence Length**: 384 tokens
+- **Label Strategy**: First token only (`label_all_tokens=False`)
+- **Framework**: Hugging Face Transformers + Trainer API
+## Intended Use & Limitations
+### Intended Use
+- **De-identification**: Automated redaction of PII in clinical notes, medical records, and documents
+- **Compliance**: Supporting HIPAA, GDPR, and privacy regulation compliance
+- **Data Preprocessing**: Preparing datasets for research by removing sensitive information
+- **Audit Support**: Identifying PII in document collections
+### Limitations
+⚠️ **Important**: This model is intended as an **assistive tool**, not a replacement for human review.
+- **False Negatives**: Some PII may not be detected; always verify critical applications
+- **Context Sensitivity**: Performance may vary with domain-specific terminology
+- **Challenging Categories**: `occupation`, `time`, and `sexuality` have lower F1 scores
+- **Language**: Primarily trained on English text
+## Citation
+```bibtex
+@misc{openmed-pii-2026,
+  title = {OpenMed-PII-ClinicalE5-Base-109M-v1: PII Detection Model},
+  author = {OpenMed Science},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/openmed/OpenMed-PII-ClinicalE5-Base-109M-v1}
+}
+```
+## Links
+- **Organization**: [OpenMed](https://huggingface.co/OpenMed)

all_results.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "epoch": 3.0,
+    "eval_accuracy": 0.994285048763938,
+    "eval_f1": 0.9540518164994642,
+    "eval_loss": 0.023267928510904312,
+    "eval_precision": 0.9574556473290651,
+    "eval_recall": 0.9506721017130101,
+    "eval_runtime": 14.8896,
+    "eval_samples_per_second": 335.804,
+    "eval_steps_per_second": 5.306,
+    "test_accuracy": 0.9942960950558528,
+    "test_f1": 0.9545491518553041,
+    "test_loss": 0.022622624412178993,
+    "test_precision": 0.9572360415390746,
+    "test_recall": 0.9518773037460972,
+    "test_runtime": 189.6487,
+    "test_samples_per_second": 237.281,
+    "test_steps_per_second": 3.712,
+    "total_flos": 1.8767762625368064e+16,
+    "train_loss": 0.1060635477882836,
+    "train_runtime": 842.9188,
+    "train_samples_per_second": 177.953,
+    "train_steps_per_second": 5.563
+}

config.json ADDED Viewed

	@@ -0,0 +1,241 @@

+{
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-account_number",
+    "2": "B-age",
+    "3": "B-api_key",
+    "4": "B-bank_routing_number",
+    "5": "B-biometric_identifier",
+    "6": "B-blood_type",
+    "7": "B-certificate_license_number",
+    "8": "B-city",
+    "9": "B-company_name",
+    "10": "B-coordinate",
+    "11": "B-country",
+    "12": "B-county",
+    "13": "B-credit_debit_card",
+    "14": "B-customer_id",
+    "15": "B-cvv",
+    "16": "B-date",
+    "17": "B-date_of_birth",
+    "18": "B-date_time",
+    "19": "B-device_identifier",
+    "20": "B-education_level",
+    "21": "B-email",
+    "22": "B-employee_id",
+    "23": "B-employment_status",
+    "24": "B-fax_number",
+    "25": "B-first_name",
+    "26": "B-gender",
+    "27": "B-health_plan_beneficiary_number",
+    "28": "B-http_cookie",
+    "29": "B-ipv4",
+    "30": "B-ipv6",
+    "31": "B-language",
+    "32": "B-last_name",
+    "33": "B-license_plate",
+    "34": "B-mac_address",
+    "35": "B-medical_record_number",
+    "36": "B-occupation",
+    "37": "B-password",
+    "38": "B-phone_number",
+    "39": "B-pin",
+    "40": "B-political_view",
+    "41": "B-postcode",
+    "42": "B-race_ethnicity",
+    "43": "B-religious_belief",
+    "44": "B-sexuality",
+    "45": "B-ssn",
+    "46": "B-state",
+    "47": "B-street_address",
+    "48": "B-swift_bic",
+    "49": "B-tax_id",
+    "50": "B-time",
+    "51": "B-unique_id",
+    "52": "B-url",
+    "53": "B-user_name",
+    "54": "B-vehicle_identifier",
+    "55": "I-account_number",
+    "56": "I-api_key",
+    "57": "I-biometric_identifier",
+    "58": "I-blood_type",
+    "59": "I-certificate_license_number",
+    "60": "I-city",
+    "61": "I-company_name",
+    "62": "I-coordinate",
+    "63": "I-country",
+    "64": "I-county",
+    "65": "I-credit_debit_card",
+    "66": "I-customer_id",
+    "67": "I-date",
+    "68": "I-date_of_birth",
+    "69": "I-date_time",
+    "70": "I-device_identifier",
+    "71": "I-education_level",
+    "72": "I-email",
+    "73": "I-employee_id",
+    "74": "I-employment_status",
+    "75": "I-fax_number",
+    "76": "I-first_name",
+    "77": "I-gender",
+    "78": "I-health_plan_beneficiary_number",
+    "79": "I-http_cookie",
+    "80": "I-ipv4",
+    "81": "I-ipv6",
+    "82": "I-language",
+    "83": "I-last_name",
+    "84": "I-license_plate",
+    "85": "I-mac_address",
+    "86": "I-medical_record_number",
+    "87": "I-occupation",
+    "88": "I-password",
+    "89": "I-phone_number",
+    "90": "I-pin",
+    "91": "I-political_view",
+    "92": "I-postcode",
+    "93": "I-race_ethnicity",
+    "94": "I-religious_belief",
+    "95": "I-sexuality",
+    "96": "I-ssn",
+    "97": "I-state",
+    "98": "I-street_address",
+    "99": "I-swift_bic",
+    "100": "I-tax_id",
+    "101": "I-time",
+    "102": "I-unique_id",
+    "103": "I-url",
+    "104": "I-user_name",
+    "105": "I-vehicle_identifier"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-account_number": 1,
+    "B-age": 2,
+    "B-api_key": 3,
+    "B-bank_routing_number": 4,
+    "B-biometric_identifier": 5,
+    "B-blood_type": 6,
+    "B-certificate_license_number": 7,
+    "B-city": 8,
+    "B-company_name": 9,
+    "B-coordinate": 10,
+    "B-country": 11,
+    "B-county": 12,
+    "B-credit_debit_card": 13,
+    "B-customer_id": 14,
+    "B-cvv": 15,
+    "B-date": 16,
+    "B-date_of_birth": 17,
+    "B-date_time": 18,
+    "B-device_identifier": 19,
+    "B-education_level": 20,
+    "B-email": 21,
+    "B-employee_id": 22,
+    "B-employment_status": 23,
+    "B-fax_number": 24,
+    "B-first_name": 25,
+    "B-gender": 26,
+    "B-health_plan_beneficiary_number": 27,
+    "B-http_cookie": 28,
+    "B-ipv4": 29,
+    "B-ipv6": 30,
+    "B-language": 31,
+    "B-last_name": 32,
+    "B-license_plate": 33,
+    "B-mac_address": 34,
+    "B-medical_record_number": 35,
+    "B-occupation": 36,
+    "B-password": 37,
+    "B-phone_number": 38,
+    "B-pin": 39,
+    "B-political_view": 40,
+    "B-postcode": 41,
+    "B-race_ethnicity": 42,
+    "B-religious_belief": 43,
+    "B-sexuality": 44,
+    "B-ssn": 45,
+    "B-state": 46,
+    "B-street_address": 47,
+    "B-swift_bic": 48,
+    "B-tax_id": 49,
+    "B-time": 50,
+    "B-unique_id": 51,
+    "B-url": 52,
+    "B-user_name": 53,
+    "B-vehicle_identifier": 54,
+    "I-account_number": 55,
+    "I-api_key": 56,
+    "I-biometric_identifier": 57,
+    "I-blood_type": 58,
+    "I-certificate_license_number": 59,
+    "I-city": 60,
+    "I-company_name": 61,
+    "I-coordinate": 62,
+    "I-country": 63,
+    "I-county": 64,
+    "I-credit_debit_card": 65,
+    "I-customer_id": 66,
+    "I-date": 67,
+    "I-date_of_birth": 68,
+    "I-date_time": 69,
+    "I-device_identifier": 70,
+    "I-education_level": 71,
+    "I-email": 72,
+    "I-employee_id": 73,
+    "I-employment_status": 74,
+    "I-fax_number": 75,
+    "I-first_name": 76,
+    "I-gender": 77,
+    "I-health_plan_beneficiary_number": 78,
+    "I-http_cookie": 79,
+    "I-ipv4": 80,
+    "I-ipv6": 81,
+    "I-language": 82,
+    "I-last_name": 83,
+    "I-license_plate": 84,
+    "I-mac_address": 85,
+    "I-medical_record_number": 86,
+    "I-occupation": 87,
+    "I-password": 88,
+    "I-phone_number": 89,
+    "I-pin": 90,
+    "I-political_view": 91,
+    "I-postcode": 92,
+    "I-race_ethnicity": 93,
+    "I-religious_belief": 94,
+    "I-sexuality": 95,
+    "I-ssn": 96,
+    "I-state": 97,
+    "I-street_address": 98,
+    "I-swift_bic": 99,
+    "I-tax_id": 100,
+    "I-time": 101,
+    "I-unique_id": 102,
+    "I-url": 103,
+    "I-user_name": 104,
+    "I-vehicle_identifier": 105,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

eval_results.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "epoch": 3.0,
+    "eval_accuracy": 0.994285048763938,
+    "eval_f1": 0.9540518164994642,
+    "eval_loss": 0.023267928510904312,
+    "eval_precision": 0.9574556473290651,
+    "eval_recall": 0.9506721017130101,
+    "eval_runtime": 14.8896,
+    "eval_samples_per_second": 335.804,
+    "eval_steps_per_second": 5.306
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9369e0498e743fd2563ac1aee389eea156d6543b4d9858c54188e1e3400d605b
+size 435915992

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

test_results.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "test_accuracy": 0.9942960950558528,
+    "test_f1": 0.9545491518553041,
+    "test_loss": 0.022622624412178993,
+    "test_precision": 0.9572360415390746,
+    "test_recall": 0.9518773037460972,
+    "test_runtime": 189.6487,
+    "test_samples_per_second": 237.281,
+    "test_steps_per_second": 3.712
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

train_results.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+    "epoch": 3.0,
+    "total_flos": 1.8767762625368064e+16,
+    "train_loss": 0.1060635477882836,
+    "train_runtime": 842.9188,
+    "train_samples_per_second": 177.953,
+    "train_steps_per_second": 5.563
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff