|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: chuuhtetnaing/myanmar-pos-model |
|
|
tags: |
|
|
- token-classification |
|
|
- myanmar |
|
|
- ner-tagging |
|
|
language: |
|
|
- my |
|
|
datasets: |
|
|
- chuuhtetnaing/myanmar-ner-dataset |
|
|
metrics: |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# Myanmar ner Tagging Model |
|
|
|
|
|
Fine-tuned [myanmar-pos-model](https://huggingface.co/chuuhtetnaing/myanmar-pos-model) for Myanmar NER tagging. |
|
|
|
|
|
## Training Results |
|
|
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
|
|-------|---------------|-----------------|-----------|--------|------|----------| |
|
|
| 1 | 1.5385 | 0.3730 | 0.5397 | 0.5068 | 0.5227 | 0.9175 | |
|
|
| 2 | 0.2673 | 0.1809 | 0.7271 | 0.7958 | 0.7599 | 0.9481 | |
|
|
| 3 | 0.1623 | 0.1295 | 0.7815 | 0.8408 | 0.8101 | 0.9637 | |
|
|
| 4 | 0.1291 | 0.1015 | 0.7836 | 0.8602 | 0.8201 | 0.9710 | |
|
|
| 5 | 0.0992 | 0.0965 | 0.8200 | 0.8943 | 0.8555 | 0.9719 | |
|
|
| 6 | 0.0801 | 0.0879 | 0.8299 | 0.9019 | 0.8644 | 0.9738 | |
|
|
| 7 | 0.0706 | 0.0819 | 0.8580 | 0.9137 | 0.8849 | 0.9765 | |
|
|
| 8 | 0.0636 | 0.0768 | 0.8660 | 0.9148 | 0.8897 | 0.9780 | |
|
|
| 9 | 0.0577 | 0.0757 | 0.8784 | 0.9202 | 0.8988 | 0.9784 | |
|
|
| 10 | 0.0527 | 0.0760 | 0.8737 | 0.9125 | 0.8927 | 0.9791 | |
|
|
| 11 | 0.0506 | 0.0785 | 0.8710 | 0.9236 | 0.8965 | 0.9775 | |
|
|
| 12 | 0.0470 | 0.0754 | 0.8830 | 0.9225 | 0.9023 | 0.9794 | |
|
|
| 13 | 0.0459 | 0.0754 | 0.8896 | 0.9231 | 0.9061 | 0.9802 | |
|
|
| 14 | 0.0441 | 0.0813 | 0.8742 | 0.9274 | 0.9000 | 0.9779 | |
|
|
| 15 | 0.0398 | 0.0763 | 0.8952 | 0.9247 | 0.9097 | 0.9812 | |
|
|
| 16 | 0.0387 | 0.0841 | 0.8713 | 0.9252 | 0.8974 | 0.9779 | |
|
|
| 17 | 0.0344 | 0.0805 | 0.8924 | 0.9258 | 0.9088 | 0.9805 | |
|
|
| 18 | 0.0356 | 0.0790 | 0.8854 | 0.9279 | 0.9061 | 0.9802 | |
|
|
| 19 | 0.0333 | 0.0801 | 0.8864 | 0.9249 | 0.9052 | 0.9806 | |
|
|
| 20 | 0.0326 | 0.0788 | 0.8939 | 0.9254 | 0.9094 | 0.9817 | |
|
|
| 21 | 0.0314 | 0.0801 | 0.8863 | 0.9263 | 0.9059 | 0.9808 | |
|
|
| 22 | 0.0309 | 0.0815 | 0.8866 | 0.9267 | 0.9062 | 0.9806 | |
|
|
| 23 | 0.0310 | 0.0825 | 0.8854 | 0.9281 | 0.9062 | 0.9804 | |
|
|
| 24 | 0.0280 | 0.0828 | 0.8874 | 0.9272 | 0.9068 | 0.9807 | |
|
|
| 25 | 0.0271 | 0.0826 | 0.8884 | 0.9276 | 0.9076 | 0.9809 | |
|
|
| 26 | 0.0290 | 0.0828 | 0.8887 | 0.9272 | 0.9075 | 0.9807 | |
|
|
| 27 | 0.0318 | 0.0835 | 0.8855 | 0.9256 | 0.9051 | 0.9803 | |
|
|
| 28 | 0.0287 | 0.0837 | 0.8871 | 0.9267 | 0.9065 | 0.9805 | |
|
|
| 29 | 0.0274 | 0.0837 | 0.8855 | 0.9272 | 0.9058 | 0.9804 | |
|
|
| 30 | 0.0271 | 0.0832 | 0.8875 | 0.9267 | 0.9067 | 0.9806 | |
|
|
|
|
|
## Test Set Evaluation |
|
|
|
|
|
Evaluated on [myanmar-ner-dataset](https://huggingface.co/datasets/chuuhtetnaing/myanmar-ner-dataset) test split using seqeval metrics: |
|
|
|
|
|
| Entity | Precision | Recall | F1-Score | Support | |
|
|
|--------|-----------|--------|----------|---------| |
|
|
| DATE | 0.80 | 0.86 | 0.83 | 251 | |
|
|
| LOC | 0.93 | 0.96 | 0.95 | 2712 | |
|
|
| NUM | 0.89 | 0.92 | 0.90 | 789 | |
|
|
| ORG | 0.44 | 0.62 | 0.52 | 94 | |
|
|
| PER | 0.84 | 0.88 | 0.86 | 533 | |
|
|
| TIME | 0.62 | 0.70 | 0.66 | 57 | |
|
|
| **micro avg** | **0.89** | **0.93** | **0.91** | 4436 | |
|
|
| **macro avg** | 0.75 | 0.82 | 0.78 | 4436 | |
|
|
| **weighted avg** | **0.89** | **0.93** | **0.91** | 4436 | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | chuuhtetnaing/myanmar-pos-model | |
|
|
| Total Epochs | 30 | |
|
|
| Total Steps | 510 | |
|
|
| Best Checkpoint | checkpoint-255 | |
|
|
| Best F1 | 0.9097 | |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", grouped_entities=True) |
|
|
result = ner("ကိုမောင်သည်ရန်ကုန်မြို့သို့သွားသည်။") # Ko Maung went to Yangon city |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Evaluation Code |
|
|
|
|
|
```python |
|
|
!pip install seqeval |
|
|
|
|
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
|
from datasets import load_dataset |
|
|
from tqdm import tqdm |
|
|
from seqeval.metrics import classification_report |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForTokenClassification.from_pretrained("chuuhtetnaing/myanmar-ner-model") |
|
|
tokenizer = AutoTokenizer.from_pretrained("chuuhtetnaing/myanmar-ner-model") |
|
|
|
|
|
def tokenize_and_align_labels(examples): |
|
|
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) |
|
|
labels = [] |
|
|
for i, label in enumerate(examples["ner_tags"]): |
|
|
word_ids = tokenized_inputs.word_ids(batch_index=i) |
|
|
previous_word_idx = None |
|
|
label_ids = [] |
|
|
for word_idx in word_ids: |
|
|
if word_idx is None: |
|
|
label_ids.append(-100) |
|
|
elif word_idx != previous_word_idx: |
|
|
label_ids.append(label[word_idx]) |
|
|
else: |
|
|
label_ids.append(-100) |
|
|
previous_word_idx = word_idx |
|
|
labels.append(label_ids) |
|
|
tokenized_inputs["labels"] = labels |
|
|
return tokenized_inputs |
|
|
|
|
|
# Load and tokenize dataset |
|
|
ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", aggregation_strategy=None) |
|
|
ds = load_dataset("chuuhtetnaing/myanmar-ner-dataset") |
|
|
tokenized_ds = ds.map(tokenize_and_align_labels, batched=True) |
|
|
test_ds = tokenized_ds["test"] |
|
|
|
|
|
# Get label mapping |
|
|
label_list = model.config.id2label |
|
|
|
|
|
y_true = [] |
|
|
y_pred = [] |
|
|
|
|
|
for example in tqdm(test_ds): |
|
|
tokens = tokenizer.convert_ids_to_tokens(example["input_ids"]) |
|
|
true_labels = [label_list[l] if l != -100 else "O" for l in example["labels"]] |
|
|
|
|
|
text = tokenizer.decode(example["input_ids"], skip_special_tokens=True) |
|
|
preds = ner(text) |
|
|
|
|
|
pred_labels = ["O"] * len(true_labels) |
|
|
for pred in preds: |
|
|
idx = pred["index"] |
|
|
if idx < len(pred_labels): |
|
|
pred_labels[idx] = pred["entity"] |
|
|
|
|
|
y_true.append([label_list[l] for l in example["labels"] if l != -100]) |
|
|
y_pred.append([p for p, l in zip(pred_labels, example["labels"]) if l != -100]) |
|
|
|
|
|
print(classification_report(y_true, y_pred)) |
|
|
``` |
|
|
|
|
|
## NER Labels |
|
|
|
|
|
| Tag | Description | |
|
|
|-----|-------------| |
|
|
| B-DATE | Beginning of Date | |
|
|
| I-DATE | Inside Date | |
|
|
| B-LOC | Beginning of Location | |
|
|
| I-LOC | Inside Location | |
|
|
| B-NUM | Beginning of Number | |
|
|
| I-NUM | Inside Number | |
|
|
| B-ORG | Beginning of Organization | |
|
|
| I-ORG | Inside Organization | |
|
|
| B-PER | Beginning of Person | |
|
|
| I-PER | Inside Person | |
|
|
| B-TIME | Beginning of Time | |
|
|
| I-TIME | Inside Time | |
|
|
| O | Outside (Not an entity) | |
|
|
|