Update README.md

03b034f verified 6 months ago

12.5 kB

	---
	license: cc-by-4.0
	language:
	- az
	base_model:
	- FacebookAI/xlm-roberta-base
	pipeline_tag: token-classification
	tags:
	- personally identifiable information
	- pii
	- ner
	- azerbaijan
	datasets:
	- LocalDoc/pii_ner_azerbaijani
	---


	# PII NER Azerbaijani v2

	PII NER Azerbaijani is a second version of fine-tuned Named Entity Recognition (NER) model (First version: <a target="_blank" href="https://huggingface.co/LocalDoc/private_ner_azerbaijani">PII NER Azerbaijani</a>) based on XLM-RoBERTa.
	It is trained on Azerbaijani pii data for classification personally identifiable information such as names, dates of birth, cities, addresses, and phone numbers from text.

	## Model Details

	- Base Model: XLM-RoBERTa
	- Training Metrics:
	-
	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \|
	\|-------\|----------------\|------------------\|-----------\|---------\|----------\|
	\| 1 \| 0.029100 \| 0.025319 \| 0.963367 \| 0.962449\| 0.962907 \|
	\| 2 \| 0.019900 \| 0.023291 \| 0.964567 \| 0.968474\| 0.966517 \|
	\| 3 \| 0.015400 \| 0.018993 \| 0.969536 \| 0.967555\| 0.968544 \|
	\| 4 \| 0.012700 \| 0.017730 \| 0.971919 \| 0.969768\| 0.970842 \|
	\| 5 \| 0.011100 \| 0.018095 \| 0.973056 \| 0.970075\| 0.971563 \|



	- Test Metrics:

	- Precision: 0.9760
	- Recall: 0.9732
	- F1 Score: 0.9746


	## Detailed Test Classification Report

	\| Entity \| Precision \| Recall \| F1-score \| Support \|
	\|---------------------\|-----------\|--------\|----------\|---------\|
	\| AGE \| 0.98 \| 0.98 \| 0.98 \| 509 \|
	\| BUILDINGNUM \| 0.97 \| 0.75 \| 0.85 \| 1285 \|
	\| CITY \| 1.00 \| 1.00 \| 1.00 \| 2100 \|
	\| CREDITCARDNUMBER \| 0.99 \| 0.98 \| 0.99 \| 249 \|
	\| DATE \| 0.85 \| 0.92 \| 0.88 \| 1576 \|
	\| DRIVERLICENSENUM \| 0.98 \| 0.98 \| 0.98 \| 258 \|
	\| EMAIL \| 0.98 \| 1.00 \| 0.99 \| 1485 \|
	\| GIVENNAME \| 0.99 \| 1.00 \| 0.99 \| 9926 \|
	\| IDCARDNUM \| 0.99 \| 0.99 \| 0.99 \| 1174 \|
	\| PASSPORTNUM \| 0.99 \| 0.99 \| 0.99 \| 426 \|
	\| STREET \| 0.94 \| 0.98 \| 0.96 \| 1480 \|
	\| SURNAME \| 1.00 \| 1.00 \| 1.00 \| 3357 \|
	\| TAXNUM \| 0.99 \| 1.00 \| 0.99 \| 240 \|
	\| TELEPHONENUM \| 0.97 \| 0.95 \| 0.96 \| 2175 \|
	\| TIME \| 0.96 \| 0.96 \| 0.96 \| 2216 \|
	\| ZIPCODE \| 0.97 \| 0.97 \| 0.97 \| 520 \|


	### Averages

	\| Metric \| Precision \| Recall \| F1-score \| Support \|
	\|---------------\|-----------\|--------\|----------\|---------\|
	\| Micro avg \| 0.98 \| 0.97 \| 0.97 \| 28976 \|
	\| Macro avg \| 0.97 \| 0.96 \| 0.97 \| 28976 \|
	\| Weighted avg \| 0.98 \| 0.97 \| 0.97 \| 28976 \|


	## A list of entities that the model is able to recognize.

	```python
	[
	"AGE",
	"BUILDINGNUM",
	"CITY",
	"CREDITCARDNUMBER",
	"DATE",
	"DRIVERLICENSENUM",
	"EMAIL",
	"GIVENNAME",
	"IDCARDNUM",
	"PASSPORTNUM",
	"STREET",
	"SURNAME",
	"TAXNUM",
	"TELEPHONENUM",
	"TIME",
	"ZIPCODE"
	]

	```

	## Usage

	To use the model for spell correction:

	The model is trained to work with lowercase text. This code automatically normalizes the text. If you use custom code, keep this in mind.

	```python
	import torch
	from transformers import AutoModelForTokenClassification, XLMRobertaTokenizerFast
	import numpy as np
	from typing import List, Dict, Tuple

	class AzerbaijaniNER:
	def __init__(self, model_name_or_path="LocalDoc/private_ner_azerbaijani_v2"):
	self.model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)
	self.tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")

	self.model.eval()

	self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	self.model.to(self.device)

	self.id_to_label = {
	0: "O",
	1: "B-AGE", 2: "B-BUILDINGNUM", 3: "B-CITY", 4: "B-CREDITCARDNUMBER",
	5: "B-DATE", 6: "B-DRIVERLICENSENUM", 7: "B-EMAIL", 8: "B-GIVENNAME",
	9: "B-IDCARDNUM", 10: "B-PASSPORTNUM", 11: "B-STREET", 12: "B-SURNAME",
	13: "B-TAXNUM", 14: "B-TELEPHONENUM", 15: "B-TIME", 16: "B-ZIPCODE",
	17: "I-AGE", 18: "I-BUILDINGNUM", 19: "I-CITY", 20: "I-CREDITCARDNUMBER",
	21: "I-DATE", 22: "I-DRIVERLICENSENUM", 23: "I-EMAIL", 24: "I-GIVENNAME",
	25: "I-IDCARDNUM", 26: "I-PASSPORTNUM", 27: "I-STREET", 28: "I-SURNAME",
	29: "I-TAXNUM", 30: "I-TELEPHONENUM", 31: "I-TIME", 32: "I-ZIPCODE"
	}

	self.entity_types = {
	"AGE": "Age",
	"BUILDINGNUM": "Building Number",
	"CITY": "City",
	"CREDITCARDNUMBER": "Credit Card Number",
	"DATE": "Date",
	"DRIVERLICENSENUM": "Driver License Number",
	"EMAIL": "Email",
	"GIVENNAME": "Given Name",
	"IDCARDNUM": "ID Card Number",
	"PASSPORTNUM": "Passport Number",
	"STREET": "Street",
	"SURNAME": "Surname",
	"TAXNUM": "Tax ID Number",
	"TELEPHONENUM": "Phone Number",
	"TIME": "Time",
	"ZIPCODE": "Zip Code"
	}

	def predict(self, text: str, max_length: int = 512) -> List[Dict]:
	text = text.lower()

	inputs = self.tokenizer(
	text,
	return_tensors="pt",
	max_length=max_length,
	padding="max_length",
	truncation=True,
	return_offsets_mapping=True
	)

	offset_mapping = inputs.pop("offset_mapping").numpy()[0]

	inputs = {k: v.to(self.device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = self.model(**inputs)
	predictions = outputs.logits.argmax(dim=2)

	predictions = predictions[0].cpu().numpy()

	entities = []
	current_entity = None

	for idx, (offset, pred_id) in enumerate(zip(offset_mapping, predictions)):
	if offset[0] == 0 and offset[1] == 0:
	continue

	pred_label = self.id_to_label[pred_id]

	if pred_label.startswith("B-"):
	if current_entity:
	entities.append(current_entity)

	entity_type = pred_label[2:]
	current_entity = {
	"label": entity_type,
	"name": self.entity_types.get(entity_type, entity_type),
	"start": int(offset[0]),
	"end": int(offset[1]),
	"value": text[offset[0]:offset[1]]
	}

	elif pred_label.startswith("I-") and current_entity is not None:
	entity_type = pred_label[2:]

	if entity_type == current_entity["label"]:
	current_entity["end"] = int(offset[1])
	current_entity["value"] = text[current_entity["start"]:current_entity["end"]]
	else:
	entities.append(current_entity)
	current_entity = None

	elif pred_label == "O" and current_entity is not None:
	entities.append(current_entity)
	current_entity = None

	if current_entity:
	entities.append(current_entity)

	return entities

	def anonymize_text(self, text: str, replacement_char: str = "X") -> Tuple[str, List[Dict]]:
	entities = self.predict(text)

	if not entities:
	return text, []

	entities.sort(key=lambda x: x["start"], reverse=True)

	anonymized_text = text
	for entity in entities:
	start = entity["start"]
	end = entity["end"]
	length = end - start
	anonymized_text = anonymized_text[:start] + replacement_char * length + anonymized_text[end:]

	entities.sort(key=lambda x: x["start"])

	return anonymized_text, entities

	def highlight_entities(self, text: str) -> str:
	entities = self.predict(text)

	if not entities:
	return text

	entities.sort(key=lambda x: x["start"], reverse=True)

	highlighted_text = text
	for entity in entities:
	start = entity["start"]
	end = entity["end"]
	entity_value = entity["value"]
	entity_type = entity["name"]

	highlighted_text = (
	highlighted_text[:start] +
	f"[{entity_type}: {entity_value}]" +
	highlighted_text[end:]
	)

	return highlighted_text

	if __name__ == "__main__":
	ner = AzerbaijaniNER()

	test_text = """Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?"""

	print("=== Original Text ===")
	print(test_text)
	print("\n=== Found Entities ===")

	entities = ner.predict(test_text)
	for entity in entities:
	print(f"{entity['name']}: {entity['value']} (positions {entity['start']}-{entity['end']})")

	print("\n=== Text with Highlighted Entities ===")
	highlighted_text = ner.highlight_entities(test_text)
	print(highlighted_text)

	print("\n=== Anonymized Text ===")
	anonymized_text, _ = ner.anonymize_text(test_text)
	print(anonymized_text)
	```

	```
	=== Original Text ===
	Salam, mənim adım Əli Hüseynovdu. Doğum tarixim 15.05.1990-dır. Bakı şəhərində, 28 may küçəsi 4 ünvanında yaşayıram. Telefon nömrəm +994552345678-dir. Mən 4169741358254152 nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

	=== Found Entities ===
	Given Name: əli (positions 18-21)
	Surname: hüseynov (positions 22-30)
	Date: 15.05.1990 (positions 48-58)
	City: bakı (positions 64-68)
	Street: 28 may küçəsi (positions 80-93)
	Building Number: 4 (positions 94-95)
	Phone Number: +994552345678 (positions 132-145)
	Credit Card Number: 4169741358254152 (positions 155-171)

	=== Text with Highlighted Entities ===
	Salam, mənim adım [Given Name: əli] [Surname: hüseynov]du. Doğum tarixim [Date: 15.05.1990]-dır. [City: bakı] şəhərində, [Street: 28 may küçəsi] [Building Number: 4] ünvanında yaşayıram. Telefon nömrəm [Phone Number: +994552345678]-dir. Mən [Credit Card Number: 4169741358254152] nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?

	=== Anonymized Text ===
	Salam, mənim adım XXX XXXXXXXXdu. Doğum tarixim XXXXXXXXXX-dır. XXXX şəhərində, XXXXXXXXXXXXX X ünvanında yaşayıram. Telefon nömrəm XXXXXXXXXXXXX-dir. Mən XXXXXXXXXXXXXXXX nömrəli kartdan ödəniş etmişəm. Sifarişim nə vaxt çatdırılcaq ?
	```


	## CC BY 4.0 License — What It Allows

	The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:

	### ✅ You Can:
	- Use the model for any purpose, including commercial use.
	- Share it — copy and redistribute in any medium or format.
	- Adapt it — remix, transform, and build upon it for any purpose, even commercially.

	### 📝 You Must:
	- Give appropriate credit — Attribute the original creator (e.g., name, link to the license, and indicate if changes were made).
	- Not imply endorsement — Do not suggest the original author endorses you or your use.

	### ❌ You Cannot:
	- Apply legal terms or technological measures that legally restrict others from doing anything the license permits (no DRM or additional restrictions).


	### Summary:
	You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.


	For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by/4.0/deed.en">CC BY 4.0 license</a>.


	## Contact

	For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].