gravitee-io
/

bert-small-pii-detection

+---
+license: apache-2.0
+datasets:
+- beki/privy
+language:
+- en
+base_model:
+- prajjwal1/bert-small
+pipeline_tag: token-classification
+---
+# gravitee-io/bert-small-pii-detection 🚀
+**A more accurate PII detector** fine-tuned from [`prajjwal1/bert-small`](https://huggingface.co/prajjwal1/bert-small) on the [`beki/privy`](https://huggingface.co/datasets/beki/privy) dataset.
+Compared to the [`bert-tiny`](https://huggingface.co/gravitee-io/bert-tiny-pii-detection) variant, this model is larger and slower, but significantly improves precision and recall across most entity types.
+### Label Set
+```
+AGE, COORDINATE, CREDIT_CARD, DATE_TIME, EMAIL_ADDRESS, FINANCIAL, IBAN_CODE, IMEI,
+IP_ADDRESS, LOCATION, MAC_ADDRESS, NRP, ORGANIZATION, PASSWORD, PERSON, PHONE_NUMBER,
+TITLE, URL, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_LICENSE_PLATE, US_PASSPORT, US_SSN
+```
+---
+## How to Use
+### Quick start (pipeline)
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+repo = "gravitee-io/bert-small-pii-detection"
+tok = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForTokenClassification.from_pretrained(repo)
+pipe = pipeline("token-classification", model=model, tokenizer=tok, aggregation_strategy="simple")
+text = ""
+pipe(text)
+```
+---
+## Evaluation
+**Test set:** `beki/privy` held-out split
+**Metric:** precision / recall / F1 per entity, micro/macro averages
+| Entity             | Precision | Recall | F1-score | Support |
+|--------------------|-----------|--------|----------|---------|
+| AGE                | 0.9898    | 0.8858 | 0.9349   | 219     |
+| COORDINATE         | 0.9627    | 0.8738 | 0.9161   | 325     |
+| CREDIT_CARD        | 0.9273    | 0.8870 | 0.9067   | 115     |
+| DATE_TIME          | 0.8598    | 0.7364 | 0.7933   | 3255    |
+| EMAIL_ADDRESS      | 0.9428    | 0.8941 | 0.9178   | 387     |
+| FINANCIAL          | 0.9862    | 0.9565 | 0.9711   | 299     |
+| IBAN_CODE          | 0.9577    | 0.9252 | 0.9412   | 147     |
+| IMEI               | 0.9885    | 0.9663 | 0.9773   | 89      |
+| IP_ADDRESS         | 0.9338    | 0.8812 | 0.9068   | 160     |
+| LOCATION           | 0.8849    | 0.8222 | 0.8524   | 4264    |
+| MAC_ADDRESS        | 0.9889    | 1.0000 | 0.9944   | 89      |
+| NRP                | 1.0000    | 0.9818 | 0.9908   | 494     |
+| ORGANIZATION       | 0.7454    | 0.6688 | 0.7051   | 3551    |
+| PASSWORD           | 0.8384    | 0.8137 | 0.8259   | 102     |
+| PERSON             | 0.9123    | 0.8826 | 0.8972   | 4454    |
+| PHONE_NUMBER       | 0.9462    | 0.8199 | 0.8785   | 322     |
+| TITLE              | 0.9887    | 0.9734 | 0.9810   | 451     |
+| URL                | 1.0000    | 0.9787 | 0.9892   | 188     |
+| US_BANK_NUMBER     | 1.0000    | 0.9579 | 0.9785   | 95      |
+| US_DRIVER_LICENSE  | 0.9167    | 0.9167 | 0.9167   | 120     |
+| US_ITIN            | 0.9659    | 0.8763 | 0.9189   | 97      |
+| US_LICENSE_PLATE   | 1.0000    | 0.9000 | 0.9474   | 90      |
+| US_PASSPORT        | 0.9200    | 0.9200 | 0.9200   | 100     |
+| US_SSN             | 0.9744    | 0.9580 | 0.9661   | 119     |
+| **micro avg**      | 0.8804    | 0.8141 | 0.8460   | 19532   |
+| **macro avg**      | 0.9429    | 0.8948 | 0.9178   | 19532   |
+| **weighted avg**   | 0.8785    | 0.8141 | 0.8446   | 19532   |
+---
+## Intended Uses & Limitations
+**Use this model for:**
+* Redacting PII in customer support logs, dev/test environments, API traces and articles
+* Pre-screening documents before storage or external sharing
+* Real-time hints in form fields or data entry systems
+**Limitations:**
+* English-focused; other languages will degrade
+* Domain drift is real: audit on your own data
+---
+## Citation
+If you use the model, please consider citing the papers:
+```
+@misc{bhargava2021generalization,
+      title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics},
+      author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers},
+      year={2021},
+      eprint={2110.01518},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+@article{DBLP:journals/corr/abs-1908-08962,
+  author    = {Iulia Turc and
+               Ming{-}Wei Chang and
+               Kenton Lee and
+               Kristina Toutanova},
+  title     = {Well-Read Students Learn Better: The Impact of Student Initialization
+               on Knowledge Distillation},
+  journal   = {CoRR},
+  volume    = {abs/1908.08962},
+  year      = {2019},
+  url       = {http://arxiv.org/abs/1908.08962},
+  eprinttype = {arXiv},
+  eprint    = {1908.08962},
+  timestamp = {Thu, 29 Aug 2019 16:32:34 +0200},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+@online{WinNT,
+  author = {Benjamin Kilimnik},
+  title = {{Privy} Synthetic PII Protocol Trace Dataset},
+  year = 2022,
+  url = {https://huggingface.co/datasets/beki/privy},
+}
+```