Armenian NER Model (XLM-RoBERTa based)
This repository contains a Named Entity Recognition (NER) model for the Armenian language, fine-tuned from the xlm-roberta-base checkpoint.
The model identifies the following entity types based on the pioNER dataset tags: PER (Person), LOC (Location), ORG (Organization), EVT (Event), PRO (Product), FAC (Facility), ANG (Animal), DUC (Document), WRK (Work of Art), CMP (Chemical Compound/Drug), MSR (Measure/Quantity), DTM (Date/Time), MNY (Money), PCT (Percent), LAG (Language), LAW (Law), NOR (Nationality/Religious/Political Group).
This specific checkpoint (daviddallakyan2005/armenian-ner) corresponds to training run run_16 from the associated project, selected based on the best F1 score on the pioNER validation set during a hyperparameter search involving 36 variations.
Associated GitHub Repository: https://github.com/daviddallakyan2005/armenian-ner-network.git (Contains training, inference, and network analysis scripts)
Model Details
- Base Model:
xlm-roberta-base(Originating from research associated with Facebook AI Research'sfairseqlibrary) - Training Data: pioNER Corpus (specifically,
pioner-silverfor training/validation andpioner-goldfor testing, loaded viaconll2003dataset script). - Fine-tuning Framework:
transformers,pytorch - Hyperparameters (run_16):
- Learning Rate:
2e-5 - Weight Decay:
0.01 - Batch Size (per device):
8 - Gradient Accumulation Steps:
1 - Epochs:
7
- Learning Rate:
Intended Uses & Limitations
This model is designed for general-purpose Named Entity Recognition in Armenian text. It leverages the ArmTokenizer library for pre-tokenization in the inference process shown in the associated project's scripts (armenian-ner-network), although the transformers pipeline example below uses the built-in xlm-roberta-base tokenizer directly.
- Primary Use: NER / Token Classification for Armenian.
- Limitations: Performance might degrade on domains significantly different from the pioNER dataset. The aggregation strategy in the example below is simple; more complex strategies might be needed for optimal entity boundary detection in all cases.
How to Use
You can easily use this model with the transformers library pipeline:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
# Load the tokenizer and model from Hugging Face Hub
model_name = "daviddallakyan2005/armenian-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
# Use "simple" aggregation for basic entity grouping
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
text = "Գրիգոր Նարեկացին հայ միջնադարյան հոգևորական էր, աստվածաբան և բանաստեղծ։ Նա ծնվել է Նարեկ գյուղում։"
# Get predictions
entities = ner_pipeline(text)
print(entities)
# Example Output:
# [
# {'entity_group': 'PER', 'score': 0.99..., 'word': 'Գրիգոր Նարեկացին', 'start': 0, 'end': 16},
# {'entity_group': 'LOC', 'score': 0.98..., 'word': 'Նարեկ', 'start': 87, 'end': 92}
# ]
(See scripts/03_ner/run_ner_inference_segmented.py in the GitHub repo for an example integrating ArmTokenizer before passing words to the Hugging Face tokenizer)
Training Procedure
The xlm-roberta-base model was fine-tuned on the Armenian pioNER dataset using the transformers Trainer API. The training involved:
- Loading the
conll2003format pioNER data. - Tokenizing the text using the
xlm-roberta-basetokenizer and aligning NER tags to subword tokens (labeling only the first subword of each word). - Setting up
TrainingArgumentswith varying hyperparameters (learning rate, weight decay, epochs, gradient accumulation). - Instantiating
AutoModelForTokenClassificationwith the correct number of labels and mappings (id2label,label2id) derived from the dataset. - Using
DataCollatorForTokenClassificationfor batching. - Implementing a
compute_metricsfunction usingseqeval(precision, recall, F1) for evaluation during training. - Running a hyperparameter search over 36 combinations, saving checkpoints and logs for each run.
- Selecting the best model based on the highest F1 score achieved on the validation set (
pioner-silver/dev.conll03). - Evaluating the best model on the test set (
pioner-gold/test.conll03).
(See scripts/03_ner/ner_roberta.py in the GitHub repo for the full training script.)
Evaluation
This model (run_16) achieved the best F1 score on the pioNER validation set during the hyperparameter search. Final evaluation metrics on the pioNER gold test set are logged in the training artifacts within the associated GitHub project.
Citation
If you use this model or the associated code, please consider citing the GitHub repository:
@misc{armenian_ner_network_2025,
author = {David Dallakyan},
title = {Armenian NER and Network Analysis Project},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/daviddallakyan2005/armenian-ner-network}}
}
Please also cite the original XLM-RoBERTa paper and the pioNER dataset creators if applicable.
- Downloads last month
- 56