Spaces:
Sleeping
Sleeping
Upload 12 files
Browse files- Bert_base_NER_PII43k/.gitattributes +35 -0
- Bert_base_NER_PII43k/README.md +54 -0
- Bert_base_NER_PII43k/config.json +84 -0
- Bert_base_NER_PII43k/model.safetensors +3 -0
- Bert_base_NER_PII43k/special_tokens_map.json +7 -0
- Bert_base_NER_PII43k/tokenizer.json +0 -0
- Bert_base_NER_PII43k/tokenizer_config.json +55 -0
- Bert_base_NER_PII43k/training_args.bin +3 -0
- Bert_base_NER_PII43k/vocab.txt +0 -0
- README.md +83 -7
- app.py +202 -0
- requirements.txt +4 -0
Bert_base_NER_PII43k/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
Bert_base_NER_PII43k/README.md
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: google-bert/bert-base-uncased
|
| 4 |
+
tags:
|
| 5 |
+
- generated_from_trainer
|
| 6 |
+
model-index:
|
| 7 |
+
- name: Bert_base_NER_PII43k
|
| 8 |
+
results: []
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 12 |
+
should probably proofread and complete it, then remove this comment. -->
|
| 13 |
+
|
| 14 |
+
[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/tuevu_smu/huggingface/runs/5vg4k8gw)
|
| 15 |
+
# Bert_base_NER_PII43k
|
| 16 |
+
|
| 17 |
+
This model is a fine-tuned version of [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) on the [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k) dataset.
|
| 18 |
+
|
| 19 |
+
## Model description
|
| 20 |
+
|
| 21 |
+
More information needed
|
| 22 |
+
|
| 23 |
+
## Intended uses & limitations
|
| 24 |
+
|
| 25 |
+
More information needed
|
| 26 |
+
|
| 27 |
+
## Training and evaluation data
|
| 28 |
+
|
| 29 |
+
More information needed
|
| 30 |
+
|
| 31 |
+
## Training procedure
|
| 32 |
+
|
| 33 |
+
### Training hyperparameters
|
| 34 |
+
|
| 35 |
+
The following hyperparameters were used during training:
|
| 36 |
+
- learning_rate: 5e-05
|
| 37 |
+
- train_batch_size: 16
|
| 38 |
+
- eval_batch_size: 64
|
| 39 |
+
- seed: 42
|
| 40 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 41 |
+
- lr_scheduler_type: linear
|
| 42 |
+
- lr_scheduler_warmup_steps: 500
|
| 43 |
+
- num_epochs: 3
|
| 44 |
+
|
| 45 |
+
### Training results
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
### Framework versions
|
| 50 |
+
|
| 51 |
+
- Transformers 4.41.0.dev0
|
| 52 |
+
- Pytorch 1.13.1
|
| 53 |
+
- Datasets 2.18.0
|
| 54 |
+
- Tokenizers 0.19.1
|
Bert_base_NER_PII43k/config.json
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "google-bert/bert-base-uncased",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"BertForTokenClassification"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"id2label": {
|
| 13 |
+
"0": "ACCOUNTNAME",
|
| 14 |
+
"1": "ACCOUNTNUM",
|
| 15 |
+
"2": "USERNAME",
|
| 16 |
+
"3": "NEARBYGPSCOORDINATE",
|
| 17 |
+
"4": "ORDINALDIRECTION",
|
| 18 |
+
"5": "COINADDRESS",
|
| 19 |
+
"6": "CREDITCARDISSUER",
|
| 20 |
+
"7": "CREDITCARDNUM",
|
| 21 |
+
"8": "CURRENCY",
|
| 22 |
+
"9": "DISPLAYNAME",
|
| 23 |
+
"10": "EMAIL",
|
| 24 |
+
"11": "GENDER",
|
| 25 |
+
"12": "GEO",
|
| 26 |
+
"13": "IBAN",
|
| 27 |
+
"14": "ZIPCODE",
|
| 28 |
+
"15": "IP",
|
| 29 |
+
"16": "JOB",
|
| 30 |
+
"17": "MAC",
|
| 31 |
+
"18": "NAME",
|
| 32 |
+
"19": "NUM",
|
| 33 |
+
"20": "PASSWORD",
|
| 34 |
+
"21": "STREET",
|
| 35 |
+
"22": "URL",
|
| 36 |
+
"23": "USERAGENT",
|
| 37 |
+
"24": "ADDRESS",
|
| 38 |
+
"25": "BIC",
|
| 39 |
+
"26": "O"
|
| 40 |
+
},
|
| 41 |
+
"initializer_range": 0.02,
|
| 42 |
+
"intermediate_size": 3072,
|
| 43 |
+
"label2id": {
|
| 44 |
+
"ACCOUNTNAME": 0,
|
| 45 |
+
"ACCOUNTNUM": 1,
|
| 46 |
+
"ADDRESS": 24,
|
| 47 |
+
"BIC": 25,
|
| 48 |
+
"COINADDRESS": 5,
|
| 49 |
+
"CREDITCARDISSUER": 6,
|
| 50 |
+
"CREDITCARDNUM": 7,
|
| 51 |
+
"CURRENCY": 8,
|
| 52 |
+
"DISPLAYNAME": 9,
|
| 53 |
+
"EMAIL": 10,
|
| 54 |
+
"GENDER": 11,
|
| 55 |
+
"GEO": 12,
|
| 56 |
+
"IBAN": 13,
|
| 57 |
+
"IP": 15,
|
| 58 |
+
"JOB": 16,
|
| 59 |
+
"MAC": 17,
|
| 60 |
+
"NAME": 18,
|
| 61 |
+
"NEARBYGPSCOORDINATE": 3,
|
| 62 |
+
"NUM": 19,
|
| 63 |
+
"O": 26,
|
| 64 |
+
"ORDINALDIRECTION": 4,
|
| 65 |
+
"PASSWORD": 20,
|
| 66 |
+
"STREET": 21,
|
| 67 |
+
"URL": 22,
|
| 68 |
+
"USERAGENT": 23,
|
| 69 |
+
"USERNAME": 2,
|
| 70 |
+
"ZIPCODE": 14
|
| 71 |
+
},
|
| 72 |
+
"layer_norm_eps": 1e-12,
|
| 73 |
+
"max_position_embeddings": 512,
|
| 74 |
+
"model_type": "bert",
|
| 75 |
+
"num_attention_heads": 12,
|
| 76 |
+
"num_hidden_layers": 12,
|
| 77 |
+
"pad_token_id": 0,
|
| 78 |
+
"position_embedding_type": "absolute",
|
| 79 |
+
"torch_dtype": "float32",
|
| 80 |
+
"transformers_version": "4.41.0.dev0",
|
| 81 |
+
"type_vocab_size": 2,
|
| 82 |
+
"use_cache": true,
|
| 83 |
+
"vocab_size": 30522
|
| 84 |
+
}
|
Bert_base_NER_PII43k/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0a55d6ddc6954f3e07a94d6211ac7f9668ffd2aa8cc824d02ae463edc025148f
|
| 3 |
+
size 134
|
Bert_base_NER_PII43k/special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
Bert_base_NER_PII43k/tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Bert_base_NER_PII43k/tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"mask_token": "[MASK]",
|
| 48 |
+
"model_max_length": 512,
|
| 49 |
+
"pad_token": "[PAD]",
|
| 50 |
+
"sep_token": "[SEP]",
|
| 51 |
+
"strip_accents": null,
|
| 52 |
+
"tokenize_chinese_chars": true,
|
| 53 |
+
"tokenizer_class": "BertTokenizer",
|
| 54 |
+
"unk_token": "[UNK]"
|
| 55 |
+
}
|
Bert_base_NER_PII43k/training_args.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:244bc12c4638255ef18230e73685dca6d51a1931705da8015039fdd9a6225f9e
|
| 3 |
+
size 129
|
Bert_base_NER_PII43k/vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
README.md
CHANGED
|
@@ -1,14 +1,90 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
-
short_description: Train Bert base model with PII data
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: PII Detection with BERT
|
| 3 |
+
emoji: π
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# PII Detection with BERT
|
| 14 |
+
|
| 15 |
+
This Space demonstrates a BERT model fine-tuned for detecting Personal Identifiable Information (PII) in text.
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
- **Base Model**: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
|
| 20 |
+
- **Training Dataset**: [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k)
|
| 21 |
+
- **Task**: Token Classification / Named Entity Recognition (NER)
|
| 22 |
+
- **Number of Entity Types**: 27
|
| 23 |
+
|
| 24 |
+
## Detectable PII Types
|
| 25 |
+
|
| 26 |
+
The model can identify 27 different types of personal information:
|
| 27 |
+
|
| 28 |
+
### Identity Information
|
| 29 |
+
- NAME, USERNAME, DISPLAYNAME, GENDER, JOB
|
| 30 |
+
|
| 31 |
+
### Contact Information
|
| 32 |
+
- EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE
|
| 33 |
+
|
| 34 |
+
### Financial Information
|
| 35 |
+
- CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC
|
| 36 |
+
- ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS
|
| 37 |
+
|
| 38 |
+
### Technical Information
|
| 39 |
+
- IP, MAC, URL, USERAGENT, PASSWORD
|
| 40 |
+
|
| 41 |
+
### Other
|
| 42 |
+
- NUM, ORDINALDIRECTION
|
| 43 |
+
|
| 44 |
+
## How It Works
|
| 45 |
+
|
| 46 |
+
1. **Input**: User provides text that may contain personal information
|
| 47 |
+
2. **Tokenization**: Text is split into tokens using BERT tokenizer
|
| 48 |
+
3. **Classification**: Each token is classified into one of 27 entity types or "O" (no entity)
|
| 49 |
+
4. **Visualization**: Detected entities are highlighted with different colors
|
| 50 |
+
|
| 51 |
+
## Training Details
|
| 52 |
+
|
| 53 |
+
- Learning Rate: 5e-05
|
| 54 |
+
- Batch Size: 16 (train), 64 (eval)
|
| 55 |
+
- Epochs: 3
|
| 56 |
+
- Optimizer: Adam (Ξ²1=0.9, Ξ²2=0.999, Ξ΅=1e-08)
|
| 57 |
+
- Warmup Steps: 500
|
| 58 |
+
|
| 59 |
+
## Use Cases
|
| 60 |
+
|
| 61 |
+
- **Data Privacy**: Identify PII before sharing documents
|
| 62 |
+
- **Data Anonymization**: Find information that needs masking
|
| 63 |
+
- **Compliance**: Help meet GDPR, CCPA requirements
|
| 64 |
+
- **Security**: Detect sensitive information leaks
|
| 65 |
+
|
| 66 |
+
## Limitations
|
| 67 |
+
|
| 68 |
+
- Maximum input length: 512 tokens
|
| 69 |
+
- Optimized for English text
|
| 70 |
+
- May not detect all variations of PII
|
| 71 |
+
- Performance depends on text format and quality
|
| 72 |
+
|
| 73 |
+
## Example Usage
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 77 |
+
|
| 78 |
+
model_name = "your-username/your-space-name" # Update after deployment
|
| 79 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 80 |
+
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
| 81 |
+
|
| 82 |
+
text = "My name is John Smith and my email is john@example.com"
|
| 83 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 84 |
+
outputs = model(**inputs)
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## License
|
| 88 |
+
|
| 89 |
+
Apache 2.0
|
| 90 |
+
|
app.py
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
HuggingFace Space App for PII Detection
|
| 3 |
+
This app uses a BERT model to identify Personal Identifiable Information in text.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import gradio as gr
|
| 7 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 8 |
+
import torch
|
| 9 |
+
|
| 10 |
+
# Load the model and tokenizer
|
| 11 |
+
MODEL_PATH = "./Bert_base_NER_PII43k"
|
| 12 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
|
| 13 |
+
model = AutoModelForTokenClassification.from_pretrained(MODEL_PATH)
|
| 14 |
+
|
| 15 |
+
# Entity label colors for visualization
|
| 16 |
+
ENTITY_COLORS = {
|
| 17 |
+
"NAME": "#FF6B6B",
|
| 18 |
+
"EMAIL": "#4ECDC4",
|
| 19 |
+
"CREDITCARDNUM": "#FFE66D",
|
| 20 |
+
"IP": "#95E1D3",
|
| 21 |
+
"PASSWORD": "#F38181",
|
| 22 |
+
"STREET": "#AA96DA",
|
| 23 |
+
"ACCOUNTNAME": "#FCBAD3",
|
| 24 |
+
"ACCOUNTNUM": "#FFFFD2",
|
| 25 |
+
"USERNAME": "#A8E6CF",
|
| 26 |
+
"ZIPCODE": "#FFD3B6",
|
| 27 |
+
"IBAN": "#FFAAA5",
|
| 28 |
+
"URL": "#FF8B94",
|
| 29 |
+
"JOB": "#C7CEEA",
|
| 30 |
+
"GENDER": "#FFDAC1",
|
| 31 |
+
"ADDRESS": "#B5EAD7",
|
| 32 |
+
"MAC": "#C9CBA3",
|
| 33 |
+
"GEO": "#FFE2E2",
|
| 34 |
+
"NEARBYGPSCOORDINATE": "#F7D9C4",
|
| 35 |
+
"COINADDRESS": "#FAACA8",
|
| 36 |
+
"CREDITCARDISSUER": "#DCD6F7",
|
| 37 |
+
"CURRENCY": "#A6D9F7",
|
| 38 |
+
"DISPLAYNAME": "#FAD9A1",
|
| 39 |
+
"NUM": "#D4F1F4",
|
| 40 |
+
"BIC": "#FFB6B9",
|
| 41 |
+
"USERAGENT": "#C2E9FB",
|
| 42 |
+
"ORDINALDIRECTION": "#F6EAC2",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def detect_pii(text):
|
| 47 |
+
"""
|
| 48 |
+
Detect PII entities in the input text.
|
| 49 |
+
|
| 50 |
+
Args:
|
| 51 |
+
text (str): Input text to analyze
|
| 52 |
+
|
| 53 |
+
Returns:
|
| 54 |
+
list: Highlighted entities for Gradio display
|
| 55 |
+
str: Summary of detected entities
|
| 56 |
+
"""
|
| 57 |
+
if not text.strip():
|
| 58 |
+
return None, "Please enter some text to analyze."
|
| 59 |
+
|
| 60 |
+
# Tokenize input
|
| 61 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
| 62 |
+
|
| 63 |
+
# Get predictions
|
| 64 |
+
with torch.no_grad():
|
| 65 |
+
outputs = model(**inputs)
|
| 66 |
+
predictions = torch.argmax(outputs.logits, dim=2)
|
| 67 |
+
|
| 68 |
+
# Convert tokens back to words and align with predictions
|
| 69 |
+
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
| 70 |
+
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
|
| 71 |
+
|
| 72 |
+
# Reconstruct words and their labels
|
| 73 |
+
highlighted_entities = []
|
| 74 |
+
current_word = ""
|
| 75 |
+
current_label = None
|
| 76 |
+
|
| 77 |
+
for token, label in zip(tokens, predicted_labels):
|
| 78 |
+
# Skip special tokens
|
| 79 |
+
if token in ["[CLS]", "[SEP]", "[PAD]"]:
|
| 80 |
+
continue
|
| 81 |
+
|
| 82 |
+
# Handle subword tokens (starting with ##)
|
| 83 |
+
if token.startswith("##"):
|
| 84 |
+
current_word += token[2:]
|
| 85 |
+
else:
|
| 86 |
+
# Save previous word if it exists
|
| 87 |
+
if current_word:
|
| 88 |
+
if current_label and current_label != "O":
|
| 89 |
+
highlighted_entities.append((current_word, current_label))
|
| 90 |
+
else:
|
| 91 |
+
highlighted_entities.append((current_word, None))
|
| 92 |
+
current_word = " " # Add space between words
|
| 93 |
+
|
| 94 |
+
current_word += token
|
| 95 |
+
current_label = label
|
| 96 |
+
|
| 97 |
+
# Add the last word
|
| 98 |
+
if current_word.strip():
|
| 99 |
+
if current_label and current_label != "O":
|
| 100 |
+
highlighted_entities.append((current_word, current_label))
|
| 101 |
+
else:
|
| 102 |
+
highlighted_entities.append((current_word, None))
|
| 103 |
+
|
| 104 |
+
# Create summary
|
| 105 |
+
detected_entities = {}
|
| 106 |
+
for word, label in highlighted_entities:
|
| 107 |
+
if label and label != "O":
|
| 108 |
+
if label not in detected_entities:
|
| 109 |
+
detected_entities[label] = []
|
| 110 |
+
detected_entities[label].append(word.strip())
|
| 111 |
+
|
| 112 |
+
if detected_entities:
|
| 113 |
+
summary = "**Detected PII:**\n\n"
|
| 114 |
+
for entity_type, words in detected_entities.items():
|
| 115 |
+
summary += f"- **{entity_type}**: {', '.join(words)}\n"
|
| 116 |
+
else:
|
| 117 |
+
summary = "No PII detected in the text."
|
| 118 |
+
|
| 119 |
+
return highlighted_entities, summary
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
# Example texts for users to try
|
| 123 |
+
examples = [
|
| 124 |
+
["My name is John Smith and my email is john.smith@example.com. I live at 123 Main Street."],
|
| 125 |
+
["Please send the payment to IBAN GB29 NWBK 6016 1331 9268 19 or call me at my office."],
|
| 126 |
+
["Contact Sarah Johnson at sarah.j@company.org for more details about the project."],
|
| 127 |
+
["My credit card number is 4532-1234-5678-9010 and my username is mike_user123."],
|
| 128 |
+
]
|
| 129 |
+
|
| 130 |
+
# Create Gradio interface
|
| 131 |
+
with gr.Blocks(title="PII Detection with BERT", theme=gr.themes.Soft()) as demo:
|
| 132 |
+
gr.Markdown(
|
| 133 |
+
"""
|
| 134 |
+
# π Personal Identifiable Information (PII) Detector
|
| 135 |
+
|
| 136 |
+
This tool uses a fine-tuned BERT model to automatically detect and highlight personal information in text.
|
| 137 |
+
It can identify **27 different types** of PII including names, emails, addresses, credit cards, and more.
|
| 138 |
+
|
| 139 |
+
### How to use:
|
| 140 |
+
1. Enter or paste text in the box below
|
| 141 |
+
2. Click "Detect PII" to analyze
|
| 142 |
+
3. View highlighted entities and summary
|
| 143 |
+
"""
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
with gr.Row():
|
| 147 |
+
with gr.Column():
|
| 148 |
+
input_text = gr.Textbox(
|
| 149 |
+
label="Input Text",
|
| 150 |
+
placeholder="Enter text to analyze for PII...",
|
| 151 |
+
lines=6,
|
| 152 |
+
)
|
| 153 |
+
detect_btn = gr.Button("π Detect PII", variant="primary")
|
| 154 |
+
|
| 155 |
+
with gr.Column():
|
| 156 |
+
output_highlighted = gr.HighlightedText(
|
| 157 |
+
label="Highlighted PII Entities",
|
| 158 |
+
combine_adjacent=True,
|
| 159 |
+
color_map=ENTITY_COLORS,
|
| 160 |
+
)
|
| 161 |
+
output_summary = gr.Markdown(label="Summary")
|
| 162 |
+
|
| 163 |
+
gr.Markdown("### π Try these examples:")
|
| 164 |
+
gr.Examples(
|
| 165 |
+
examples=examples,
|
| 166 |
+
inputs=input_text,
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
gr.Markdown(
|
| 170 |
+
"""
|
| 171 |
+
### π·οΈ Detectable Entity Types:
|
| 172 |
+
|
| 173 |
+
**Identity**: NAME, USERNAME, DISPLAYNAME, GENDER, JOB
|
| 174 |
+
**Contact**: EMAIL, STREET, ADDRESS, ZIPCODE, GEO, NEARBYGPSCOORDINATE
|
| 175 |
+
**Financial**: CREDITCARDNUM, CREDITCARDISSUER, IBAN, BIC, ACCOUNTNAME, ACCOUNTNUM, CURRENCY, COINADDRESS
|
| 176 |
+
**Technical**: IP, MAC, URL, USERAGENT, PASSWORD
|
| 177 |
+
**Other**: NUM, ORDINALDIRECTION
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
**Model**: BERT-base fine-tuned on [ai4privacy/pii-masking-43k](https://huggingface.co/datasets/ai4privacy/pii-masking-43k) dataset
|
| 181 |
+
**Base Model**: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
|
| 182 |
+
"""
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
# Connect the button to the function
|
| 186 |
+
detect_btn.click(
|
| 187 |
+
fn=detect_pii,
|
| 188 |
+
inputs=input_text,
|
| 189 |
+
outputs=[output_highlighted, output_summary]
|
| 190 |
+
)
|
| 191 |
+
|
| 192 |
+
# Also trigger on Enter key
|
| 193 |
+
input_text.submit(
|
| 194 |
+
fn=detect_pii,
|
| 195 |
+
inputs=input_text,
|
| 196 |
+
outputs=[output_highlighted, output_summary]
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
# Launch the app
|
| 200 |
+
if __name__ == "__main__":
|
| 201 |
+
demo.launch()
|
| 202 |
+
|
requirements.txt
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==4.44.0
|
| 2 |
+
transformers==4.45.0
|
| 3 |
+
torch==2.1.0
|
| 4 |
+
|