|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- az |
|
|
base_model: |
|
|
- FacebookAI/xlm-roberta-base |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- personally identifiable information |
|
|
- pii |
|
|
- ner |
|
|
- azerbaijan |
|
|
datasets: |
|
|
- LocalDoc/pii_ner_azerbaijani |
|
|
--- |
|
|
|
|
|
|
|
|
# PII NER Azerbaijani v2 |
|
|
|
|
|
**PII NER Azerbaijani** is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa. |
|
|
It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** XLM-RoBERTa |
|
|
- **Training Metrics:** |
|
|
- |
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |
|
|
|-------|----------------|------------------|-----------|---------|----------| |
|
|
| 1 | 0.029100 | 0.025319 | 0.963367 | 0.962449| 0.962907 | |
|
|
| 2 | 0.019900 | 0.023291 | 0.964567 | 0.968474| 0.966517 | |
|
|
| 3 | 0.015400 | 0.018993 | 0.969536 | 0.967555| 0.968544 | |
|
|
| 4 | 0.012700 | 0.017730 | 0.971919 | 0.969768| 0.970842 | |
|
|
| 5 | 0.011100 | 0.018095 | 0.973056 | 0.970075| 0.971563 | |
|
|
|
|
|
|
|
|
|
|
|
- **Test Metrics:** |
|
|
|
|
|
- **Precision:** 0.9760 |
|
|
- **Recall:** 0.9732 |
|
|
- **F1 Score:** 0.9746 |
|
|
|
|
|
|
|
|
## Detailed Test Classification Report |
|
|
|
|
|
| Entity | Precision | Recall | F1-score | Support | |
|
|
|---------------------|-----------|--------|----------|---------| |
|
|
| AGE | 0.98 | 0.98 | 0.98 | 509 | |
|
|
| BUILDINGNUM | 0.97 | 0.75 | 0.85 | 1285 | |
|
|
| CITY | 1.00 | 1.00 | 1.00 | 2100 | |
|
|
| CREDITCARDNUMBER | 0.99 | 0.98 | 0.99 | 249 | |
|
|
| DATE | 0.85 | 0.92 | 0.88 | 1576 | |
|
|
| DRIVERLICENSENUM | 0.98 | 0.98 | 0.98 | 258 | |
|
|
| EMAIL | 0.98 | 1.00 | 0.99 | 1485 | |
|
|
| GIVENNAME | 0.99 | 1.00 | 0.99 | 9926 | |
|
|
| IDCARDNUM | 0.99 | 0.99 | 0.99 | 1174 | |
|
|
| PASSPORTNUM | 0.99 | 0.99 | 0.99 | 426 | |
|
|
| STREET | 0.94 | 0.98 | 0.96 | 1480 | |
|
|
| SURNAME | 1.00 | 1.00 | 1.00 | 3357 | |
|
|
| TAXNUM | 0.99 | 1.00 | 0.99 | 240 | |
|
|
| TELEPHONENUM | 0.97 | 0.95 | 0.96 | 2175 | |
|
|
| TIME | 0.96 | 0.96 | 0.96 | 2216 | |
|
|
| ZIPCODE | 0.97 | 0.97 | 0.97 | 520 | |
|
|
|
|
|
|
|
|
### Averages |
|
|
|
|
|
| Metric | Precision | Recall | F1-score | Support | |
|
|
|---------------|-----------|--------|----------|---------| |
|
|
| **Micro avg** | 0.98 | 0.97 | 0.97 | 28976 | |
|
|
| **Macro avg** | 0.97 | 0.96 | 0.97 | 28976 | |
|
|
| **Weighted avg** | 0.98 | 0.97 | 0.97 | 28976 | |
|
|
|
|
|
|
|
|
## A list of entities that the model is able to recognize. |
|
|
|
|
|
```python |
|
|
[ |
|
|
"AGE", |
|
|
"BUILDINGNUM", |
|
|
"CITY", |
|
|
"CREDITCARDNUMBER", |
|
|
"DATE", |
|
|
"DRIVERLICENSENUM", |
|
|
"EMAIL", |
|
|
"GIVENNAME", |
|
|
"IDCARDNUM", |
|
|
"PASSPORTNUM", |
|
|
"STREET", |
|
|
"SURNAME", |
|
|
"TAXNUM", |
|
|
"TELEPHONENUM", |
|
|
"TIME", |
|
|
"ZIPCODE" |
|
|
] |
|
|
|
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
To use the model for spell correction: |
|
|
|
|
|
The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast |
|
|
import numpy as np |
|
|
from typing import List, Dict, Tuple |
|
|
|
|
|
class AzerbaijaniNER: |
|
|
def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"): |
|
|
self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path) |
|
|
self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base") |
|
|
|
|
|
self.model.eval() |
|
|
|
|
|
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
self.model.to(self.device) |
|
|
|
|
|
self.id_to_label = { |
|
|
0: "O", |
|
|
1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER", |
|
|
5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME", |
|
|
9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME", |
|
|
13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE", |
|
|
17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER", |
|
|
21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME", |
|
|
25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME", |
|
|
29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE" |
|
|
} |
|
|
|
|
|
self.entity_types = { |
|
|
"AGE": "Age", |
|
|
"BUILDINGNUM": "Building Number", |
|
|
"CITY": "City", |
|
|
"CREDITCARDNUMBER": "Credit Card Number", |
|
|
"DATE": "Date", |
|
|
"DRIVERLICENSENUM": "Driver License Number", |
|
|
"EMAIL": "Email", |
|
|
"GIVENNAME": "Given Name", |
|
|
"IDCARDNUM": "ID Card Number", |
|
|
"PASSPORTNUM": "Passport Number", |
|
|
"STREET": "Street", |
|
|
"SURNAME": "Surname", |
|
|
"TAXNUM": "Tax ID Number", |
|
|
"TELEPHONENUM": "Phone Number", |
|
|
"TIME": "Time", |
|
|
"ZIPCODE": "Zip Code" |
|
|
} |
|
|
|
|
|
def predict(self, text: str, max_length: int = 512) -> List[Dict]: |
|
|
text = text.lower() |
|
|
|
|
|
inputs = self.tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
max_length=max_length, |
|
|
padding="max_length", |
|
|
truncation=True, |
|
|
return_offsets_mapping=True |
|
|
) |
|
|
|
|
|
offset_mapping = inputs.pop("offset_mapping").numpy()[0] |
|
|
|
|
|
inputs = {k: v.to(self.device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.model(**inputs) |
|
|
predictions = outputs.logits.argmax(dim=2) |
|
|
|
|
|
predictions = predictions[0].cpu().numpy() |
|
|
|
|
|
entities = [] |
|
|
current_entity = None |
|
|
|
|
|
for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)): |
|
|
if offset[0] == 0 and offset[1] == 0: |
|
|
continue |
|
|
|
|
|
pred_label = self.id_to_label[pred_id] |
|
|
|
|
|
if pred_label.startswith("B-"): |
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
|
|
|
entity_type = pred_label[2:] |
|
|
current_entity = { |
|
|
"label": entity_type, |
|
|
"name": self.entity_types.get(entity_type, entity_type), |
|
|
"start": int(offset[0]), |
|
|
"end": int(offset[1]), |
|
|
"value": text[offset[0]:offset[1]] |
|
|
} |
|
|
|
|
|
elif pred_label.startswith("I-") and current_entity is not None: |
|
|
entity_type = pred_label[2:] |
|
|
|
|
|
if entity_type == current_entity["label"]: |
|
|
current_entity["end"] = int(offset[1]) |
|
|
current_entity["value"] = text[current_entity["start"]:current_entity["end"]] |
|
|
else: |
|
|
entities.append(current_entity) |
|
|
current_entity = None |
|
|
|
|
|
elif pred_label == "O" and current_entity is not None: |
|
|
entities.append(current_entity) |
|
|
current_entity = None |
|
|
|
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
|
|
|
return entities |
|
|
|
|
|
def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]: |
|
|
entities = self.predict(text) |
|
|
|
|
|
if not entities: |
|
|
return text, [] |
|
|
|
|
|
entities.sort(key=lambda x: x["start"], reverse=True) |
|
|
|
|
|
anonymized_text = text |
|
|
for entity in entities: |
|
|
start = entity["start"] |
|
|
end = entity["end"] |
|
|
length = end - start |
|
|
anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:] |
|
|
|
|
|
entities.sort(key=lambda x: x["start"]) |
|
|
|
|
|
return anonymized_text, entities |
|
|
|
|
|
def highlight_entities(self, text: str) -> str: |
|
|
entities = self.predict(text) |
|
|
|
|
|
if not entities: |
|
|
return text |
|
|
|
|
|
entities.sort(key=lambda x: x["start"], reverse=True) |
|
|
|
|
|
highlighted_text = text |
|
|
for entity in entities: |
|
|
start = entity["start"] |
|
|
end = entity["end"] |
|
|
entity_value = entity["value"] |
|
|
entity_type = entity["name"] |
|
|
|
|
|
highlighted_text = ( |
|
|
highlighted_text[:start] + |
|
|
f"[{entity_type}: {entity_value}]" + |
|
|
highlighted_text[end:] |
|
|
) |
|
|
|
|
|
return highlighted_text |
|
|
|
|
|
if __name__ == "__main__": |
|
|
ner = AzerbaijaniNER() |
|
|
|
|
|
test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?""" |
|
|
|
|
|
print("=== Original Text ===") |
|
|
print(test_text) |
|
|
print("\n=== Found Entities ===") |
|
|
|
|
|
entities = ner.predict(test_text) |
|
|
for entity in entities: |
|
|
print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})") |
|
|
|
|
|
print("\n=== Text with Highlighted Entities ===") |
|
|
highlighted_text = ner.highlight_entities(test_text) |
|
|
print(highlighted_text) |
|
|
|
|
|
print("\n=== Anonymized Text ===") |
|
|
anonymized_text, _ = ner.anonymize_text(test_text) |
|
|
print(anonymized_text) |
|
|
``` |
|
|
|
|
|
``` |
|
|
=== Original Text === |
|
|
Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? |
|
|
|
|
|
=== Found Entities === |
|
|
Given Name: əli (positions 18-21) |
|
|
Surname: hüseynov (positions 22-30) |
|
|
Date: 15.05.1990 (positions 48-58) |
|
|
City: bakı (positions 64-68) |
|
|
Street: 28 may küçəsi (positions 80-93) |
|
|
Building Number: 4 (positions 94-95) |
|
|
Phone Number: +994552345678 (positions 132-145) |
|
|
Credit Card Number: 4169741358254152 (positions 155-171) |
|
|
|
|
|
=== Text with Highlighted Entities === |
|
|
Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? |
|
|
|
|
|
=== Anonymized Text === |
|
|
Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ? |
|
|
``` |
|
|
|
|
|
|
|
|
## CC BY 4.0 License — What It Allows |
|
|
|
|
|
The **Creative Commons Attribution 4.0 International (CC BY 4.0)** license allows: |
|
|
|
|
|
### ✅ You Can: |
|
|
- **Use** the model for any purpose, including commercial use. |
|
|
- **Share** it — copy and redistribute in any medium or format. |
|
|
- **Adapt** it — remix, transform, and build upon it for any purpose, even commercially. |
|
|
|
|
|
### 📝 You Must: |
|
|
- **Give appropriate credit** — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made). |
|
|
- **Not imply endorsement** — Do not suggest the original author endorses you or your use. |
|
|
|
|
|
### ❌ You Cannot: |
|
|
- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions). |
|
|
|
|
|
|
|
|
### Summary: |
|
|
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator. |
|
|
|
|
|
|
|
|
For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>. |
|
|
|
|
|
|
|
|
## Contact |
|
|
|
|
|
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com]. |