barflyman
/

BertSmallPII

@@ -1,156 +1,155 @@
----
-license: apache-2.0
-datasets:
-- beki/privy
-- gretelai/synthetic_pii_finance_multilingual
-- eriktks/conll2003
-language:
-- en
-base_model:
-- prajjwal1/bert-small
-pipeline_tag: token-classification
----
-# gravitee-io/bert-small-pii-detection 🚀
-**A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
-### About the dataset:
-We combined various datasets in order to cover wide range of document formats like:
-1. JSON,
-2. HTML,
-3. XML,
-4. SQL
-5. Documents
-### Label Set
-```
-AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
-IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
-TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
-```
-## How to Use
-### Quick start (pipeline)
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
-repo = "gravitee-io/bert-small-pii-detection"
-tok = AutoTokenizer.from_pretrained(repo)
-model = AutoModelForTokenClassification.from_pretrained(repo)
-pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
-text = ""
-pipe(text)
-```
-## Evaluation
-**Metric:** precision / recall / F1 per entity, micro/macro averages
-| Entity             | Precision | Recall | F1-score | Support |
-|--------------------|-----------|--------|----------|---------|
-| AGE                | 0.9898    | 0.8858 | 0.9349   | 219     |
-| COORDINATE         | 0.9627    | 0.8738 | 0.9161   | 325     |
-| CREDIT_CARD        | 0.9273    | 0.8870 | 0.9067   | 115     |
-| DATE_TIME          | 0.8598    | 0.7364 | 0.7933   | 3255    |
-| EMAIL_ADDRESS      | 0.9428    | 0.8941 | 0.9178   | 387     |
-| FINANCIAL          | 0.9862    | 0.9565 | 0.9711   | 299     |
-| IBAN_CODE          | 0.9577    | 0.9252 | 0.9412   | 147     |
-| IMEI               | 0.9885    | 0.9663 | 0.9773   | 89      |
-| IP_ADDRESS         | 0.9338    | 0.8812 | 0.9068   | 160     |
-| LOCATION           | 0.8849    | 0.8222 | 0.8524   | 4264    |
-| MAC_ADDRESS        | 0.9889    | 1.0000 | 0.9944   | 89      |
-| NRP                | 1.0000    | 0.9818 | 0.9908   | 494     |
-| ORGANIZATION       | 0.7454    | 0.6688 | 0.7051   | 3551    |
-| PASSWORD           | 0.8384    | 0.8137 | 0.8259   | 102     |
-| PERSON             | 0.9123    | 0.8826 | 0.8972   | 4454    |
-| PHONE_NUMBER       | 0.9462    | 0.8199 | 0.8785   | 322     |
-| TITLE              | 0.9887    | 0.9734 | 0.9810   | 451     |
-| URL                | 1.0000    | 0.9787 | 0.9892   | 188     |
-| US_BANK_NUMBER     | 1.0000    | 0.9579 | 0.9785   | 95      |
-| US_DRIVER_LICENSE  | 0.9167    | 0.9167 | 0.9167   | 120     |
-| US_ITIN            | 0.9659    | 0.8763 | 0.9189   | 97      |
-| US_LICENSE_PLATE   | 1.0000    | 0.9000 | 0.9474   | 90      |
-| US_PASSPORT        | 0.9200    | 0.9200 | 0.9200   | 100     |
-| US_SSN             | 0.9744    | 0.9580 | 0.9661   | 119     |
-| **micro avg**      | 0.8804    | 0.8141 | 0.8460   | 19532   |
-| **macro avg**      | 0.9429    | 0.8948 | 0.9178   | 19532   |
-| **weighted avg**   | 0.8785    | 0.8141 | 0.8446   | 19532   |
-## Intended Uses & Limitations
-**Use this model for:**
-* **Low resource environmens**
-* Redacting PII in customer support logs, dev/test environments, API traces and articles
-* Real-time hints in form fields or data entry systems
-**Limitations:**
-* English-focused; other languages will degrade
-* Domain drift is real: audit on your own data
----
-## Citation
-If you use the model, please consider citing the papers:
-```
-@misc{bhargava2021generalization,
-      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
-      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
-      year={2021},
-      eprint={2110.01518},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@article{DBLP:journals/corr/abs-1908-08962,
-  author    = {Iulia Turc and
-               Ming{-}Wei Chang and
-               Kenton Lee and
-               Kristina Toutanova},
-  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
-               on Knowledge Distillation},
-  journal   = {CoRR},
-  volume    = {abs/1908.08962},
-  year      = {2019},
-  url       = {http://arxiv.org/abs/1908.08962},
-  eprinttype = {arXiv},
-  eprint    = {1908.08962},
-  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
-  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
-  bibsource = {dblp computer science bibliography, https://dblp.org}
-}
-@online{WinNT,
-  author = {Benjamin Kilimnik},
-  title = {{Privy} Synthetic PII Protocol Trace Dataset},
-  year = 2022,
-  url = {https://huggingface.co/datasets/beki/privy},
-}
-@online{gretel2023,
-  author = {Gretel.ai},
-  title = {{Synthetic PII Finance Multilingual Dataset}},
-  year = 2023,
-  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
-}
-@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
-    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
-    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
-    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
-    year = "2003",
-    url = "https://aclanthology.org/W03-0419",
-}
-}
 ```

+---
+license: apache-2.0
+datasets:
+- beki/privy
+- gretelai/synthetic_pii_finance_multilingual
+- eriktks/conll2003
+language:
+- en
+base_model:
+- prajjwal1/bert-small
+pipeline_tag: token-classification
+---
+**A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the datasets described in metatada.
+### About the dataset:
+We combined various datasets in order to cover wide range of document formats like:
+1. JSON,
+2. HTML,
+3. XML,
+4. SQL
+5. Documents
+### Label Set
+```
+AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
+IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
+TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
+```
+## How to Use
+### Quick start (pipeline)
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+repo = "gravitee-io/bert-small-pii-detection"
+tok = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForTokenClassification.from_pretrained(repo)
+pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
+text = ""
+pipe(text)
+```
+## Evaluation
+**Metric:** precision / recall / F1 per entity, micro/macro averages
+| Entity             | Precision | Recall | F1-score | Support |
+|--------------------|-----------|--------|----------|---------|
+| AGE                | 0.9898    | 0.8858 | 0.9349   | 219     |
+| COORDINATE         | 0.9627    | 0.8738 | 0.9161   | 325     |
+| CREDIT_CARD        | 0.9273    | 0.8870 | 0.9067   | 115     |
+| DATE_TIME          | 0.8598    | 0.7364 | 0.7933   | 3255    |
+| EMAIL_ADDRESS      | 0.9428    | 0.8941 | 0.9178   | 387     |
+| FINANCIAL          | 0.9862    | 0.9565 | 0.9711   | 299     |
+| IBAN_CODE          | 0.9577    | 0.9252 | 0.9412   | 147     |
+| IMEI               | 0.9885    | 0.9663 | 0.9773   | 89      |
+| IP_ADDRESS         | 0.9338    | 0.8812 | 0.9068   | 160     |
+| LOCATION           | 0.8849    | 0.8222 | 0.8524   | 4264    |
+| MAC_ADDRESS        | 0.9889    | 1.0000 | 0.9944   | 89      |
+| NRP                | 1.0000    | 0.9818 | 0.9908   | 494     |
+| ORGANIZATION       | 0.7454    | 0.6688 | 0.7051   | 3551    |
+| PASSWORD           | 0.8384    | 0.8137 | 0.8259   | 102     |
+| PERSON             | 0.9123    | 0.8826 | 0.8972   | 4454    |
+| PHONE_NUMBER       | 0.9462    | 0.8199 | 0.8785   | 322     |
+| TITLE              | 0.9887    | 0.9734 | 0.9810   | 451     |
+| URL                | 1.0000    | 0.9787 | 0.9892   | 188     |
+| US_BANK_NUMBER     | 1.0000    | 0.9579 | 0.9785   | 95      |
+| US_DRIVER_LICENSE  | 0.9167    | 0.9167 | 0.9167   | 120     |
+| US_ITIN            | 0.9659    | 0.8763 | 0.9189   | 97      |
+| US_LICENSE_PLATE   | 1.0000    | 0.9000 | 0.9474   | 90      |
+| US_PASSPORT        | 0.9200    | 0.9200 | 0.9200   | 100     |
+| US_SSN             | 0.9744    | 0.9580 | 0.9661   | 119     |
+| **micro avg**      | 0.8804    | 0.8141 | 0.8460   | 19532   |
+| **macro avg**      | 0.9429    | 0.8948 | 0.9178   | 19532   |
+| **weighted avg**   | 0.8785    | 0.8141 | 0.8446   | 19532   |
+## Intended Uses & Limitations
+**Use this model for:**
+* **Low resource environmens**
+* Redacting PII in customer support logs, dev/test environments, API traces and articles
+* Real-time hints in form fields or data entry systems
+**Limitations:**
+* English-focused; other languages will degrade
+* Domain drift is real: audit on your own data
+---
+## Citation
+If you use the model, please consider citing the papers:
+```
+@misc{bhargava2021generalization,
+      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
+      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
+      year={2021},
+      eprint={2110.01518},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+@article{DBLP:journals/corr/abs-1908-08962,
+  author    = {Iulia Turc and
+               Ming{-}Wei Chang and
+               Kenton Lee and
+               Kristina Toutanova},
+  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
+               on Knowledge Distillation},
+  journal   = {CoRR},
+  volume    = {abs/1908.08962},
+  year      = {2019},
+  url       = {http://arxiv.org/abs/1908.08962},
+  eprinttype = {arXiv},
+  eprint    = {1908.08962},
+  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+@online{WinNT,
+  author = {Benjamin Kilimnik},
+  title = {{Privy} Synthetic PII Protocol Trace Dataset},
+  year = 2022,
+  url = {https://huggingface.co/datasets/beki/privy},
+}
+@online{gretel2023,
+  author = {Gretel.ai},
+  title = {{Synthetic PII Finance Multilingual Dataset}},
+  year = 2023,
+  url = {https://huggingface.co/datasets/gretelai/synthetic_pii_finance_multilingual},
+}
+@inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
+    title = "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition",
+    author = "Tjong Kim Sang, Erik F. and De Meulder, Fien",
+    booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003",
+    year = "2003",
+    url = "https://aclanthology.org/W03-0419",
+}
+}
 ```