srini98 commited on
Commit
402c4c7
·
1 Parent(s): 9cdaf6e

read me update

Browse files
Files changed (1) hide show
  1. README.md +46 -6
README.md CHANGED
@@ -3,9 +3,9 @@ library_name: transformers
3
  tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- This model is used for PII detection in the financial domain. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)
9
 
10
 
11
  ## Model Details
@@ -18,9 +18,9 @@ The model is built on top of the decoder transformer model (Qwen2-0.5B). We used
18
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
  - **Developed by:** Betterdata.ai
21
- - **Model type:** Deocder Transformer
22
  - **License:** [apache-2.0]
23
- - **Finetuned from model [optional]:** Qwen2-0.5B
24
 
25
  ## Uses
26
 
@@ -31,7 +31,10 @@ With the advent of chatGPT , professionals and organizations use public chat int
31
 
32
  ![image/png](https://huggingface.co/betterdataai/PII_DETECTION_MODEL/blob/main/pii_image.png")
33
 
34
- The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans.
 
 
 
35
 
36
 
37
  ### Out-of-Scope Use
@@ -48,7 +51,44 @@ The PII model shouldn't add too much latency and be able to take in long documen
48
  Use the code below to get started with the model.
49
 
50
  ```
51
- from transformers import AutoTokenizer , AutoModelForCausalLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
54
  ## Evaluation
 
3
  tags: []
4
  ---
5
 
6
+ # Model Card
7
 
8
+ This model is used for PII detection. It covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)
9
 
10
 
11
  ## Model Details
 
18
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
  - **Developed by:** Betterdata.ai
21
+ - **Model type:** Decoder Transformer
22
  - **License:** [apache-2.0]
23
+ - **Finetuned from model :** Qwen2-0.5B
24
 
25
  ## Uses
26
 
 
31
 
32
  ![image/png](https://huggingface.co/betterdataai/PII_DETECTION_MODEL/blob/main/pii_image.png")
33
 
34
+ The PII model shouldn't add too much latency and be able to take in long documents, therefore we used the Qwen 0.5B as the base model. Another consideration for the model size was that we felt a model for privacy should be easy to run even on CPUs. We do have larger models in house with better performance. We have coverage for south east asian langauges as well as giving the user the ability to define custom user classes as part of our plans.
35
+
36
+
37
+ We will constantly be improving this model so always pull the latest version of the model.
38
 
39
 
40
  ### Out-of-Scope Use
 
51
  Use the code below to get started with the model.
52
 
53
  ```
54
+ from transformers import AutoModelForCausalLM , AutoTokenizer
55
+ import torch
56
+
57
+ device = "cuda" if torch.cuda.is_available() else "cpu"
58
+ model = AutoModelForCausalLM.from_pretrained("betterdataai/PII_DETECTION_MODEL").to(device)
59
+ tokenizer = AutoTokenizer.from_pretrained("betterdataai/PII_DETECTION_MODEL")
60
+
61
+ classes_list = ['<pin>','<api_key>','<bank_routing_number>','<bban>','<company>','<credit_card_number>','<credit_card_security_code>','<customer_id>','<date>','<date_of_birth>','<date_time>','<driver_license_number>','<email>','<employee_id>','<first_name>','<iban>','<ipv4>','<ipv6>','<last_name>','<local_latlng>','<name>','<passport_number>','<password>','<phone_number>','<social_security_number>','<street_address>','<swift_bic_code>','<time>','<user_name>']
62
+
63
+
64
+ prompt = """You are an AI assistant who is responisble for identifying Personal Identifiable information (PII). You will be given a passage of text and you have to \
65
+ identify the PII data present in the passage. You should only identify the data based on the classes provided and not make up any class on your own.
66
+
67
+ ```PII Classes```
68
+ {classes}
69
+
70
+ The given text is:
71
+ {text}
72
+
73
+ The PII data are:
74
+ """
75
+
76
+
77
+
78
+ user_input = "Write an email to Julia indicating I won't be coming to office on the 29th of June"
79
+ new_prompt = prompt.format(classes="\n".join(classes_list) , text=user_input)
80
+ tokenized_input = tokenizer(new_prompt , return_tensors="pt").to(device)
81
+
82
+ output = model.generate(**tokenized_input , max_new_tokens=6000)
83
+ pii_classes = tokenizer.decode(output[0] , skip_special_tokens=True).split("The PII data are:\n")[1]
84
+
85
+ print(pii_classes)
86
+
87
+ ##output
88
+ """
89
+ <name> : ['Julia']
90
+ <date> : ['the 29th of June']
91
+ """
92
  ```
93
 
94
  ## Evaluation