read me update
Browse files
README.md
CHANGED
|
@@ -3,9 +3,9 @@ library_name: transformers
|
|
| 3 |
tags: []
|
| 4 |
---
|
| 5 |
|
| 6 |
-
# Model Card
|
| 7 |
|
| 8 |
-
This model is used for PII detection
|
| 9 |
|
| 10 |
|
| 11 |
## Model Details
|
|
@@ -18,9 +18,9 @@ The model is built on top of the decoder transformer model (Qwen2-0.5B). We used
|
|
| 18 |
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
| 19 |
|
| 20 |
- **Developed by:** Betterdata.ai
|
| 21 |
-
- **Model type:**
|
| 22 |
- **License:** [apache-2.0]
|
| 23 |
-
- **Finetuned from model
|
| 24 |
|
| 25 |
## Uses
|
| 26 |
|
|
@@ -31,7 +31,10 @@ With the advent of chatGPT , professionals and organizations use public chat int
|
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
-
The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans.
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
### Out-of-Scope Use
|
|
@@ -48,7 +51,44 @@ The PII model shouldn't add too much latency and be able to take in long documen
|
|
| 48 |
Use the code below to get started with the model.
|
| 49 |
|
| 50 |
```
|
| 51 |
-
from transformers import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
```
|
| 53 |
|
| 54 |
## Evaluation
|
|
|
|
| 3 |
tags: []
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# Model Card
|
| 7 |
|
| 8 |
+
This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)
|
| 9 |
|
| 10 |
|
| 11 |
## Model Details
|
|
|
|
| 18 |
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
| 19 |
|
| 20 |
- **Developed by:** Betterdata.ai
|
| 21 |
+
- **Model type:** Decoder Transformer
|
| 22 |
- **License:** [apache-2.0]
|
| 23 |
+
- **Finetuned from model :** Qwen2-0.5B
|
| 24 |
|
| 25 |
## Uses
|
| 26 |
|
|
|
|
| 31 |
|
| 32 |

|
| 33 |
|
| 34 |
+
The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
We will constantly be improving this model so always pull the latest version of the model.
|
| 38 |
|
| 39 |
|
| 40 |
### Out-of-Scope Use
|
|
|
|
| 51 |
Use the code below to get started with the model.
|
| 52 |
|
| 53 |
```
|
| 54 |
+
from transformers import AutoModelForCausalLM , AutoTokenizer
|
| 55 |
+
import torch
|
| 56 |
+
|
| 57 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 58 |
+
model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device)
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL")
|
| 60 |
+
|
| 61 |
+
classes_list = ['<pin>','<api_key>','<bank_routing_number>','<bban>','<company>','<credit_card_number>','<credit_card_security_code>','<customer_id>','<date>','<date_of_birth>','<date_time>','<driver_license_number>','<email>','<employee_id>','<first_name>','<iban>','<ipv4>','<ipv6>','<last_name>','<local_latlng>','<name>','<passport_number>','<password>','<phone_number>','<social_security_number>','<street_address>','<swift_bic_code>','<time>','<user_name>']
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
prompt = """You are an AI assistant who is responisble for identifying Personal Identifiable information (PII). You will be given a passage of text and you have to \
|
| 65 |
+
identify the PII data present in the passage. You should only identify the data based on the classes provided and not make up any class on your own.
|
| 66 |
+
|
| 67 |
+
```PII Classes```
|
| 68 |
+
{classes}
|
| 69 |
+
|
| 70 |
+
The given text is:
|
| 71 |
+
{text}
|
| 72 |
+
|
| 73 |
+
The PII data are:
|
| 74 |
+
"""
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
user_input = "Write an email to Julia indicating I won't be coming to office on the 29th of June"
|
| 79 |
+
new_prompt = prompt.format(classes="\n".join(classes_list) , text=user_input)
|
| 80 |
+
tokenized_input = tokenizer(new_prompt , return_tensors="pt").to(device)
|
| 81 |
+
|
| 82 |
+
output = model.generate(**tokenized_input , max_new_tokens=6000)
|
| 83 |
+
pii_classes = tokenizer.decode(output[0] , skip_special_tokens=True).split("The PII data are:\n")[1]
|
| 84 |
+
|
| 85 |
+
print(pii_classes)
|
| 86 |
+
|
| 87 |
+
##output
|
| 88 |
+
"""
|
| 89 |
+
<name> : ['Julia']
|
| 90 |
+
<date> : ['the 29th of June']
|
| 91 |
+
"""
|
| 92 |
```
|
| 93 |
|
| 94 |
## Evaluation
|