myanmar-ner-model / README.md

chuuhtetnaing

Update README.md

c3cdb31 verified 11 days ago

preview code

raw

history blame contribute delete

6.21 kB

metadata

license: apache-2.0
base_model: chuuhtetnaing/myanmar-pos-model
tags:
  - token-classification
  - myanmar
  - ner-tagging
language:
  - my
datasets:
  - chuuhtetnaing/myanmar-ner-dataset
metrics:
  - f1

Myanmar ner Tagging Model

Fine-tuned myanmar-pos-model for Myanmar NER tagging.

Training Results

Epoch	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1	1.5385	0.3730	0.5397	0.5068	0.5227	0.9175
2	0.2673	0.1809	0.7271	0.7958	0.7599	0.9481
3	0.1623	0.1295	0.7815	0.8408	0.8101	0.9637
4	0.1291	0.1015	0.7836	0.8602	0.8201	0.9710
5	0.0992	0.0965	0.8200	0.8943	0.8555	0.9719
6	0.0801	0.0879	0.8299	0.9019	0.8644	0.9738
7	0.0706	0.0819	0.8580	0.9137	0.8849	0.9765
8	0.0636	0.0768	0.8660	0.9148	0.8897	0.9780
9	0.0577	0.0757	0.8784	0.9202	0.8988	0.9784
10	0.0527	0.0760	0.8737	0.9125	0.8927	0.9791
11	0.0506	0.0785	0.8710	0.9236	0.8965	0.9775
12	0.0470	0.0754	0.8830	0.9225	0.9023	0.9794
13	0.0459	0.0754	0.8896	0.9231	0.9061	0.9802
14	0.0441	0.0813	0.8742	0.9274	0.9000	0.9779
15	0.0398	0.0763	0.8952	0.9247	0.9097	0.9812
16	0.0387	0.0841	0.8713	0.9252	0.8974	0.9779
17	0.0344	0.0805	0.8924	0.9258	0.9088	0.9805
18	0.0356	0.0790	0.8854	0.9279	0.9061	0.9802
19	0.0333	0.0801	0.8864	0.9249	0.9052	0.9806
20	0.0326	0.0788	0.8939	0.9254	0.9094	0.9817
21	0.0314	0.0801	0.8863	0.9263	0.9059	0.9808
22	0.0309	0.0815	0.8866	0.9267	0.9062	0.9806
23	0.0310	0.0825	0.8854	0.9281	0.9062	0.9804
24	0.0280	0.0828	0.8874	0.9272	0.9068	0.9807
25	0.0271	0.0826	0.8884	0.9276	0.9076	0.9809
26	0.0290	0.0828	0.8887	0.9272	0.9075	0.9807
27	0.0318	0.0835	0.8855	0.9256	0.9051	0.9803
28	0.0287	0.0837	0.8871	0.9267	0.9065	0.9805
29	0.0274	0.0837	0.8855	0.9272	0.9058	0.9804
30	0.0271	0.0832	0.8875	0.9267	0.9067	0.9806

Test Set Evaluation

Evaluated on myanmar-ner-dataset test split using seqeval metrics:

Entity	Precision	Recall	F1-Score	Support
DATE	0.80	0.86	0.83	251
LOC	0.93	0.96	0.95	2712
NUM	0.89	0.92	0.90	789
ORG	0.44	0.62	0.52	94
PER	0.84	0.88	0.86	533
TIME	0.62	0.70	0.66	57
micro avg	0.89	0.93	0.91	4436
macro avg	0.75	0.82	0.78	4436
weighted avg	0.89	0.93	0.91	4436

Training Details

Parameter	Value
Base Model	chuuhtetnaing/myanmar-pos-model
Total Epochs	30
Total Steps	510
Best Checkpoint	checkpoint-255
Best F1	0.9097

Usage

from transformers import pipeline

ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", grouped_entities=True)
result = ner("ကိုမောင်သည်ရန်ကုန်မြို့သို့သွားသည်။")  # Ko Maung went to Yangon city
print(result)

Evaluation Code

!pip install seqeval

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
from seqeval.metrics import classification_report

# Load model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("chuuhtetnaing/myanmar-ner-model")
tokenizer = AutoTokenizer.from_pretrained("chuuhtetnaing/myanmar-ner-model")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Load and tokenize dataset
ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", aggregation_strategy=None)
ds = load_dataset("chuuhtetnaing/myanmar-ner-dataset")
tokenized_ds = ds.map(tokenize_and_align_labels, batched=True)
test_ds = tokenized_ds["test"]

# Get label mapping
label_list = model.config.id2label

y_true = []
y_pred = []

for example in tqdm(test_ds):
    tokens = tokenizer.convert_ids_to_tokens(example["input_ids"])
    true_labels = [label_list[l] if l != -100 else "O" for l in example["labels"]]
    
    text = tokenizer.decode(example["input_ids"], skip_special_tokens=True)
    preds = ner(text)
    
    pred_labels = ["O"] * len(true_labels)
    for pred in preds:
        idx = pred["index"]
        if idx < len(pred_labels):
            pred_labels[idx] = pred["entity"]
    
    y_true.append([label_list[l] for l in example["labels"] if l != -100])
    y_pred.append([p for p, l in zip(pred_labels, example["labels"]) if l != -100])

print(classification_report(y_true, y_pred))

NER Labels

Tag	Description
B-DATE	Beginning of Date
I-DATE	Inside Date
B-LOC	Beginning of Location
I-LOC	Inside Location
B-NUM	Beginning of Number
I-NUM	Inside Number
B-ORG	Beginning of Organization
I-ORG	Inside Organization
B-PER	Beginning of Person
I-PER	Inside Person
B-TIME	Beginning of Time
I-TIME	Inside Time
O	Outside (Not an entity)