Azerbaijani Named Entity Recognition (NER) with XLM-RoBERTa
This project fine-tunes a custom NER model for Azerbaijani text using the multilingual XLM-RoBERTa model. This notebook and its supporting files enable extracting named entities like persons, locations, organizations, and dates from Azerbaijani text.
Notebook Source
This notebook was created in Google Colab and can be accessed here.
Setup Instructions
Install Required Libraries: The following packages are necessary for running this notebook:
pip install transformers datasets seqeval huggingface_hubHugging Face Hub Authentication: Set up Hugging Face Hub authentication to save and manage your trained models:
from huggingface_hub import login login(token="YOUR_HUGGINGFACE_TOKEN")Replace
YOUR_HUGGINGFACE_TOKENwith your Hugging Face token.Disable Unnecessary Warnings: For a cleaner output, some warnings are disabled:
import os import warnings os.environ["WANDB_DISABLED"] = "true" warnings.filterwarnings("ignore")
Detailed Code Walkthrough
1. Data Loading and Preprocessing
Loading the Azerbaijani NER Dataset
The dataset for Azerbaijani NER is loaded from the Hugging Face Hub:
from datasets import load_dataset
dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
print(dataset)
This dataset contains Azerbaijani texts labeled with NER tags.
Preprocessing Tokens and NER Tags
To ensure compatibility, the tokens and NER tags are processed using the ast module:
import ast
def preprocess_example(example):
try:
example["tokens"] = ast.literal_eval(example["tokens"])
example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"])))
except (ValueError, SyntaxError) as e:
print(f"Skipping malformed example: {example['index']} due to error: {e}")
example["tokens"] = []
example["ner_tags"] = []
return example
dataset = dataset.map(preprocess_example)
This function checks each example for format correctness, converting strings to lists of tokens and tags.
2. Tokenization and Label Alignment
Initializing the Tokenizer
The AutoTokenizer class is used to initialize the XLM-RoBERTa tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
Tokenization and Label Alignment
Each token is aligned with its label using a custom function:
def tokenize_and_align_labels(example):
tokenized_inputs = tokenizer(
example["tokens"],
truncation=True,
is_split_into_words=True,
padding="max_length",
max_length=128,
)
labels = []
word_ids = tokenized_inputs.word_ids()
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
labels.append(-100)
elif word_idx != previous_word_idx:
labels.append(example["ner_tags"][word_idx] if word_idx < len(example["ner_tags"]) else -100)
else:
labels.append(-100)
previous_word_idx = word_idx
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=False)
Tokens and labels are aligned, with -100 used to ignore sub-tokens created during tokenization.
3. Dataset Split for Training and Validation
The dataset is split into training and validation sets:
tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)
This ensures a 90-10 split, maintaining a consistent setup for training and testing.
4. Define Labels and Model Components
Define Label List
The NER tags are set up as BIO-tagging (Begin, Inside, Outside):
label_list = [
"O", "B-PERSON", "I-PERSON", "B-LOCATION", "I-LOCATION",
"B-ORGANISATION", "I-ORGANISATION", "B-DATE", "I-DATE",
"B-TIME", "I-TIME", "B-MONEY", "I-MONEY", "B-PERCENTAGE",
"I-PERCENTAGE", "B-FACILITY", "I-FACILITY", "B-PRODUCT",
"I-PRODUCT", "B-EVENT", "I-EVENT", "B-ART", "I-ART",
"B-LAW", "I-LAW", "B-LANGUAGE", "I-LANGUAGE", "B-GPE",
"I-GPE", "B-NORP", "I-NORP", "B-ORDINAL", "I-ORDINAL",
"B-CARDINAL", "I-CARDINAL", "B-DISEASE", "I-DISEASE",
"B-CONTACT", "I-CONTACT", "B-ADAGE", "I-ADAGE",
"B-QUANTITY", "I-QUANTITY", "B-MISCELLANEOUS", "I-MISCELLANEOUS",
"B-POSITION", "I-POSITION", "B-PROJECT", "I-PROJECT"
]
Initialize Model and Data Collator
The model and data collator are set up for token classification:
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
"xlm-roberta-base",
num_labels=len(label_list)
)
data_collator = DataCollatorForTokenClassification(tokenizer)
5. Define Evaluation Metrics
The model’s performance is evaluated based on precision, recall, and F1 score:
import numpy as np
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
true_predictions = [
[label_list[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return {
"precision": precision_score(true_labels, true_predictions),
"recall": recall_score(true_labels, true_predictions),
"f1": f1_score(true_labels, true_predictions),
}
6. Training Setup and Execution
Set Training Parameters
The TrainingArguments define configurations for model training:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=1e-5,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
num_train_epochs=8,
weight_decay=0.01,
fp16=True,
logging_dir='./logs',
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="f1",
report_to="none"
)
Initialize Trainer and Train the Model
The Trainer class handles training and evaluation:
from transformers import Trainer, EarlyStoppingCallback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
training_metrics = trainer.train()
eval_results = trainer.evaluate()
print(eval_results)
7. Save the Trained Model
After training, save the model and tokenizer for later use:
save_directory = "./XLM-RoBERTa"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
8. Inference with the NER Pipeline
Initialize the NER Pipeline
The pipeline provides a high-level API for NER:
from transformers import pipeline
import torch
device = 0 if torch.cuda.is_available() else -1
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)
Custom Evaluation Function
The evaluate_model function allows testing on custom sentences:
label_mapping = {f"LABEL_{i}": label for i, label in enumerate(label_list) if label != "O"}
def evaluate_model(test_texts, true_labels):
predictions = []
for i, text in enumerate(test_texts):
pred_entities = nlp_ner(text)
pred_labels = [label_mapping.get(entity["entity_group"], "O
") for entity in pred_entities if entity["entity_group"] in label_mapping]
if len(pred_labels) != len(true_labels[i]):
print(f"Warning: Inconsistent number of entities in sample {i+1}. Adjusting predicted entities.")
pred_labels = pred_labels[:len(true_labels[i])]
predictions.append(pred_labels)
if all(len(true) == len(pred) for true, pred in zip(true_labels, predictions)):
precision = precision_score(true_labels, predictions)
recall = recall_score(true_labels, predictions)
f1 = f1_score(true_labels, predictions)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print(classification_report(true_labels, predictions))
else:
print("Error: Could not align all samples correctly for evaluation.")
Test on a Sample Sentence
An example test with expected output labels:
test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."]
true_labels = [["B-PERSON", "B-ORGANISATION"]]
evaluate_model(test_texts, true_labels)