gravitee-io
/

bert-small-pii-detection

@@ -2,6 +2,8 @@
 license: apache-2.0
 datasets:
 - beki/privy
 language:
 - en
 base_model:
@@ -10,8 +12,17 @@ pipeline_tag: token-classification
 ---
 # gravitee-io/bert-small-pii-detection 🚀
-**A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the [`beki/privy`](https://huggingface.co/datasets/beki/privy) dataset.
-Compared to the [`bert-tiny`](https://huggingface.co/gravitee-io/bert-tiny-pii-detection) variant, this model is larger and slower, but significantly improves precision and recall across most entity types.
 ### Label Set
@@ -21,8 +32,6 @@ IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NU
 TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
 ```
----
 ## How to Use
 ### Quick start (pipeline)
@@ -39,11 +48,9 @@ text = ""
 pipe(text)
 ```
----
 ## Evaluation
-**Test set:** `beki/privy` held-out split
 **Metric:** precision / recall / F1 per entity, micro/macro averages
 | Entity             | Precision | Recall | F1-score | Support |
@@ -77,14 +84,12 @@ pipe(text)
 | **weighted avg**   | 0.8785    | 0.8141 | 0.8446   | 19532   |
----
 ## Intended Uses & Limitations
 **Use this model for:**
 * Redacting PII in customer support logs, dev/test environments, API traces and articles
-* Pre-screening documents before storage or external sharing
 * Real-time hints in form fields or data entry systems
 **Limitations:**
@@ -92,7 +97,6 @@ pipe(text)
 * English-focused; other languages will degrade
 * Domain drift is real: audit on your own data
 ---
 ## Citation
@@ -125,6 +129,7 @@ If you use the model, please consider citing the papers:
   timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
   biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
   bibsource = {dblp computer science bibliography, https://dblp.org}
 @online{WinNT,
   author = {Benjamin Kilimnik},
@@ -132,4 +137,20 @@ If you use the model, please consider citing the papers:
   year = 2022,
   url = {https://huggingface.co/datasets/beki/privy},
 }
 ```

 license: apache-2.0
 datasets:
 - beki/privy
+- gretelai/synthetic_pii_finance_multilingual
+- eriktks/conll2003
 language:
 - en
 base_model:
 ---
 # gravitee-io/bert-small-pii-detection 🚀
+**A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
+### About the dataset:
+We combined various datasets in order to cover wide range of document formats like:
+1. JSON,
+2. HTML,
+3. XML,
+4. SQL
+5. Documents
 ### Label Set
 TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
 ```
 ## How to Use
 ### Quick start (pipeline)
 pipe(text)
 ```
 ## Evaluation
 **Metric:** precision / recall / F1 per entity, micro/macro averages
 | Entity             | Precision | Recall | F1-score | Support |
 | **weighted avg**   | 0.8785    | 0.8141 | 0.8446   | 19532   |
 ## Intended Uses & Limitations
 **Use this model for:**
+* **Low resource environmens**
 * Redacting PII in customer support logs, dev/test environments, API traces and articles
 * Real-time hints in form fields or data entry systems
 **Limitations:**
 * English-focused; other languages will degrade
 * Domain drift is real: audit on your own data
 ---
 ## Citation
   timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
   biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
   bibsource = {dblp computer science bibliography, https://dblp.org}
+}
 @online{WinNT,
   author = {Benjamin Kilimnik},
   year = 2022,
   url = {https://huggingface.co/datasets/beki/privy},
 }
+@online{gretel2023,
+  author = {Gretel.ai},
+  title = {{Synthetic PII Finance Multilingual Dataset}},
+  year = 2023,
+  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
+}
+@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
+    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
+    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
+    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
+    year = "2003",
+    url = "https://aclanthology.org/W03-0419",
+}
+}
 ```