vrashad commited on
Commit
23d39ea
·
verified ·
1 Parent(s): eee840b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -3
README.md CHANGED
@@ -1,3 +1,197 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ language:
4
+ - az
5
+ base_model:
6
+ - FacebookAI/xlm-roberta-base
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - personally
10
+ - identifiable
11
+ - information
12
+ - recognition
13
+ - ner
14
+ ---
15
+
16
+ # LocalDoc Privacy NER Azerbaijani
17
+
18
+ **Privacy NER Azerbaijani** is a fine-tuned Named Entity Recognition (NER) model based on XLM-RoBERTa. It is trained on Azerbaijani privacy data to extract personal information such as names, dates of birth, cities, addresses, and phone numbers from text.
19
+
20
+ ## Model Details
21
+
22
+ - **Base Model:** XLM-RoBERTa
23
+ - **Training Metrics:**
24
+ - **Epoch 1:** Training Loss: 0.156, Validation Loss: 0.1309, Precision: 0.7794, Recall: 0.7940, F1: 0.7866, Accuracy: 0.9590
25
+ - **Epoch 2:** Training Loss: 0.1196, Validation Loss: 0.1172, Precision: 0.8042, Recall: 0.8078, F1: 0.8060, Accuracy: 0.9618
26
+ - **Epoch 3:** Training Loss: 0.1069, Validation Loss: 0.1129, Precision: 0.8096, Recall: 0.8213, F1: 0.8154, Accuracy: 0.9639
27
+
28
+ - **Test Metrics:**
29
+ - Loss: 0.11616, Precision: 0.80187, Recall: 0.80821, F1: 0.80503, Accuracy: 0.96264
30
+
31
+ ## Entities (id2label)
32
+
33
+ ```python
34
+ {
35
+ 0: "O",
36
+ 1: "VEHICLEVRM",
37
+ 2: "HEIGHT",
38
+ 3: "USERNAME",
39
+ 4: "FIRSTNAME",
40
+ 5: "BUILDINGNUMBER",
41
+ 6: "SEX",
42
+ 7: "PHONENUMBER",
43
+ 8: "CURRENCY",
44
+ 9: "CREDITCARDISSUER",
45
+ 10: "CURRENCYNAME",
46
+ 11: "MAC",
47
+ 12: "MIDDLENAME",
48
+ 13: "TIME",
49
+ 14: "EYECOLOR",
50
+ 15: "CURRENCYSYMBOL",
51
+ 16: "GENDER",
52
+ 17: "URL",
53
+ 18: "CURRENCYCODE",
54
+ 19: "ZIPCODE",
55
+ 20: "CREDITCARDCVV",
56
+ 21: "JOBTITLE",
57
+ 22: "PHONEIMEI",
58
+ 23: "COUNTY",
59
+ 24: "JOBTYPE",
60
+ 25: "LITECOINADDRESS",
61
+ 26: "COMPANYNAME",
62
+ 27: "ORDINALDIRECTION",
63
+ 28: "MASKEDNUMBER",
64
+ 29: "USERAGENT",
65
+ 30: "LASTNAME",
66
+ 31: "SSN",
67
+ 32: "STREET",
68
+ 33: "SECONDARYADDRESS",
69
+ 34: "STATE",
70
+ 35: "ETHEREUMADDRESS",
71
+ 36: "AMOUNT",
72
+ 37: "ACCOUNTNUMBER",
73
+ 38: "CITY",
74
+ 39: "CREDITCARDNUMBER",
75
+ 40: "BIC",
76
+ 41: "EMAIL",
77
+ 42: "NEARBYGPSCOORDINATE",
78
+ 43: "PIN",
79
+ 44: "ACCOUNTNAME",
80
+ 45: "VEHICLEVIN",
81
+ 46: "PREFIX",
82
+ 47: "JOBAREA",
83
+ 48: "AGE",
84
+ 49: "PASSWORD",
85
+ 50: "DOB",
86
+ 51: "BITCOINADDRESS",
87
+ 52: "IBAN",
88
+ 53: "IP",
89
+ 54: "DATE"
90
+ }
91
+
92
+ ## Usage
93
+
94
+ To use the model for spell correction:
95
+
96
+ ```python
97
+ import torch
98
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
99
+
100
+ model_id = "LocalDoc/private_ner_azerbaijani"
101
+
102
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
103
+ model = AutoModelForTokenClassification.from_pretrained(model_id)
104
+
105
+ test_text = (
106
+ "Salam, mənim adım Əli Hüseynovdur. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, Nizami küçəsində, 25/31 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir."
107
+ )
108
+
109
+ inputs = tokenizer(test_text, return_tensors="pt", return_offsets_mapping=True)
110
+ # Извлекаем offset_mapping и удаляем его из inputs, чтобы модель его не получала
111
+ offset_mapping = inputs.pop("offset_mapping")
112
+
113
+ with torch.no_grad():
114
+ outputs = model(**inputs)
115
+
116
+ predictions = torch.argmax(outputs.logits, dim=2)
117
+
118
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
119
+ offset_mapping = offset_mapping[0].tolist() # преобразуем к списку
120
+ predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
121
+ word_ids = inputs.word_ids(batch_index=0)
122
+
123
+ aggregated = []
124
+ prev_word_id = None
125
+ for idx, word_id in enumerate(word_ids):
126
+ if word_id is None:
127
+ continue
128
+ if word_id != prev_word_id:
129
+ aggregated.append({
130
+ "word_id": word_id,
131
+ "tokens": [tokens[idx]],
132
+ "offsets": [offset_mapping[idx]],
133
+ "label": predicted_labels[idx]
134
+ })
135
+ else:
136
+ aggregated[-1]["tokens"].append(tokens[idx])
137
+ aggregated[-1]["offsets"].append(offset_mapping[idx])
138
+ prev_word_id = word_id
139
+
140
+ entities = []
141
+ current_entity = None
142
+ for word in aggregated:
143
+ if word["label"] == "O":
144
+ if current_entity is not None:
145
+ entities.append(current_entity)
146
+ current_entity = None
147
+ else:
148
+ if current_entity is None:
149
+ current_entity = {
150
+ "type": word["label"],
151
+ "start": word["offsets"][0][0],
152
+ "end": word["offsets"][-1][1]
153
+ }
154
+ else:
155
+ if word["label"] == current_entity["type"]:
156
+ current_entity["end"] = word["offsets"][-1][1]
157
+ else:
158
+ entities.append(current_entity)
159
+ current_entity = {
160
+ "type": word["label"],
161
+ "start": word["offsets"][0][0],
162
+ "end": word["offsets"][-1][1]
163
+ }
164
+ if current_entity is not None:
165
+ entities.append(current_entity)
166
+
167
+ for entity in entities:
168
+ entity["text"] = test_text[entity["start"]:entity["end"]]
169
+
170
+ for entity in entities:
171
+ print(entity)
172
+ ```
173
+
174
+ ```json
175
+ {'type': 'FIRSTNAME', 'start': 18, 'end': 21, 'text': 'Əli'}
176
+ {'type': 'LASTNAME', 'start': 22, 'end': 34, 'text': 'Hüseynovdur.'}
177
+ {'type': 'DOB', 'start': 49, 'end': 64, 'text': '15.05.1990-dır.'}
178
+ {'type': 'STREET', 'start': 81, 'end': 87, 'text': 'Nizami'}
179
+ {'type': 'BUILDINGNUMBER', 'start': 99, 'end': 104, 'text': '25/31'}
180
+ {'type': 'PHONENUMBER', 'start': 141, 'end': 159, 'text': '+994552345678-dir.'}
181
+ ```
182
+
183
+ ## License
184
+
185
+ This model licensed under the CC BY-NC-ND 4.0 license.
186
+ What does this license allow?
187
+
188
+ Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made.
189
+ Non-Commercial: You may not use the material for commercial purposes.
190
+ No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material.
191
+
192
+ For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0 license</a>.
193
+
194
+
195
+ ## Contact
196
+
197
+ For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].