# Azerbaijani Named Entity Recognition (NER) with XLM-RoBERTa This project fine-tunes a custom NER model for Azerbaijani text using the multilingual XLM-RoBERTa model. This notebook and its supporting files enable extracting named entities like **persons**, **locations**, **organizations**, and **dates** from Azerbaijani text. ### Notebook Source This notebook was created in Google Colab and can be accessed [here](https://github.com/Ismat-Samadov/Named_Entity_Recognition/blob/main/models/XLM-RoBERTa.ipynb). ## Setup Instructions 1. **Install Required Libraries**: The following packages are necessary for running this notebook: ```bash pip install transformers datasets seqeval huggingface_hub ``` 2. **Hugging Face Hub Authentication**: Set up Hugging Face Hub authentication to save and manage your trained models: ```python from huggingface_hub import login login(token="YOUR_HUGGINGFACE_TOKEN") ``` Replace `YOUR_HUGGINGFACE_TOKEN` with your Hugging Face token. 3. **Disable Unnecessary Warnings**: For a cleaner output, some warnings are disabled: ```python import os import warnings os.environ["WANDB_DISABLED"] = "true" warnings.filterwarnings("ignore") ``` ## Detailed Code Walkthrough ### 1. **Data Loading and Preprocessing** #### Loading the Azerbaijani NER Dataset The dataset for Azerbaijani NER is loaded from the Hugging Face Hub: ```python from datasets import load_dataset dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset") print(dataset) ``` This dataset contains Azerbaijani texts labeled with NER tags. #### Preprocessing Tokens and NER Tags To ensure compatibility, the tokens and NER tags are processed using the `ast` module: ```python import ast def preprocess_example(example): try: example["tokens"] = ast.literal_eval(example["tokens"]) example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"]))) except (ValueError, SyntaxError) as e: print(f"Skipping malformed example: {example['index']} due to error: {e}") example["tokens"] = [] example["ner_tags"] = [] return example dataset = dataset.map(preprocess_example) ``` This function checks each example for format correctness, converting strings to lists of tokens and tags. ### 2. **Tokenization and Label Alignment** #### Initializing the Tokenizer The `AutoTokenizer` class is used to initialize the XLM-RoBERTa tokenizer: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") ``` #### Tokenization and Label Alignment Each token is aligned with its label using a custom function: ```python def tokenize_and_align_labels(example): tokenized_inputs = tokenizer( example["tokens"], truncation=True, is_split_into_words=True, padding="max_length", max_length=128, ) labels = [] word_ids = tokenized_inputs.word_ids() previous_word_idx = None for word_idx in word_ids: if word_idx is None: labels.append(-100) elif word_idx != previous_word_idx: labels.append(example["ner_tags"][word_idx] if word_idx < len(example["ner_tags"]) else -100) else: labels.append(-100) previous_word_idx = word_idx tokenized_inputs["labels"] = labels return tokenized_inputs tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=False) ``` Tokens and labels are aligned, with `-100` used to ignore sub-tokens created during tokenization. ### 3. **Dataset Split for Training and Validation** The dataset is split into training and validation sets: ```python tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1) ``` This ensures a 90-10 split, maintaining a consistent setup for training and testing. ### 4. **Define Labels and Model Components** #### Define Label List The NER tags are set up as BIO-tagging (Begin, Inside, Outside): ```python label_list = [ "O", "B-PERSON", "I-PERSON", "B-LOCATION", "I-LOCATION", "B-ORGANISATION", "I-ORGANISATION", "B-DATE", "I-DATE", "B-TIME", "I-TIME", "B-MONEY", "I-MONEY", "B-PERCENTAGE", "I-PERCENTAGE", "B-FACILITY", "I-FACILITY", "B-PRODUCT", "I-PRODUCT", "B-EVENT", "I-EVENT", "B-ART", "I-ART", "B-LAW", "I-LAW", "B-LANGUAGE", "I-LANGUAGE", "B-GPE", "I-GPE", "B-NORP", "I-NORP", "B-ORDINAL", "I-ORDINAL", "B-CARDINAL", "I-CARDINAL", "B-DISEASE", "I-DISEASE", "B-CONTACT", "I-CONTACT", "B-ADAGE", "I-ADAGE", "B-QUANTITY", "I-QUANTITY", "B-MISCELLANEOUS", "I-MISCELLANEOUS", "B-POSITION", "I-POSITION", "B-PROJECT", "I-PROJECT" ] ``` #### Initialize Model and Data Collator The model and data collator are set up for token classification: ```python from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification model = AutoModelForTokenClassification.from_pretrained( "xlm-roberta-base", num_labels=len(label_list) ) data_collator = DataCollatorForTokenClassification(tokenizer) ``` ### 5. **Define Evaluation Metrics** The model’s performance is evaluated based on precision, recall, and F1 score: ```python import numpy as np from seqeval.metrics import precision_score, recall_score, f1_score, classification_report def compute_metrics(p): predictions, labels = p predictions = np.argmax(predictions, axis=2) true_labels = [[label_list[l] for l in label if l != -100] for label in labels] true_predictions = [ [label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] return { "precision": precision_score(true_labels, true_predictions), "recall": recall_score(true_labels, true_predictions), "f1": f1_score(true_labels, true_predictions), } ``` ### 6. **Training Setup and Execution** #### Set Training Parameters The `TrainingArguments` define configurations for model training: ```python from transformers import TrainingArguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", save_strategy="epoch", learning_rate=1e-5, per_device_train_batch_size=64, per_device_eval_batch_size=64, num_train_epochs=8, weight_decay=0.01, fp16=True, logging_dir='./logs', save_total_limit=2, load_best_model_at_end=True, metric_for_best_model="f1", report_to="none" ) ``` #### Initialize Trainer and Train the Model The `Trainer` class handles training and evaluation: ```python from transformers import Trainer, EarlyStoppingCallback trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, callbacks=[EarlyStoppingCallback(early_stopping_patience=2)] ) training_metrics = trainer.train() eval_results = trainer.evaluate() print(eval_results) ``` ### 7. **Save the Trained Model** After training, save the model and tokenizer for later use: ```python save_directory = "./XLM-RoBERTa" model.save_pretrained(save_directory) tokenizer.save_pretrained(save_directory) ``` ### 8. **Inference with the NER Pipeline** #### Initialize the NER Pipeline The pipeline provides a high-level API for NER: ```python from transformers import pipeline import torch device = 0 if torch.cuda.is_available() else -1 nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device) ``` #### Custom Evaluation Function The `evaluate_model` function allows testing on custom sentences: ```python label_mapping = {f"LABEL_{i}": label for i, label in enumerate(label_list) if label != "O"} def evaluate_model(test_texts, true_labels): predictions = [] for i, text in enumerate(test_texts): pred_entities = nlp_ner(text) pred_labels = [label_mapping.get(entity["entity_group"], "O ") for entity in pred_entities if entity["entity_group"] in label_mapping] if len(pred_labels) != len(true_labels[i]): print(f"Warning: Inconsistent number of entities in sample {i+1}. Adjusting predicted entities.") pred_labels = pred_labels[:len(true_labels[i])] predictions.append(pred_labels) if all(len(true) == len(pred) for true, pred in zip(true_labels, predictions)): precision = precision_score(true_labels, predictions) recall = recall_score(true_labels, predictions) f1 = f1_score(true_labels, predictions) print("Precision:", precision) print("Recall:", recall) print("F1-Score:", f1) print(classification_report(true_labels, predictions)) else: print("Error: Could not align all samples correctly for evaluation.") ``` #### Test on a Sample Sentence An example test with expected output labels: ```python test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."] true_labels = [["B-PERSON", "B-ORGANISATION"]] evaluate_model(test_texts, true_labels) ```