--- license: apache-2.0 base_model: chuuhtetnaing/myanmar-pos-model tags: - token-classification - myanmar - ner-tagging language: - my datasets: - chuuhtetnaing/myanmar-ner-dataset metrics: - f1 --- # Myanmar ner Tagging Model Fine-tuned [myanmar-pos-model](https://huggingface.co/chuuhtetnaing/myanmar-pos-model) for Myanmar NER tagging. ## Training Results | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |-------|---------------|-----------------|-----------|--------|------|----------| | 1 | 1.5385 | 0.3730 | 0.5397 | 0.5068 | 0.5227 | 0.9175 | | 2 | 0.2673 | 0.1809 | 0.7271 | 0.7958 | 0.7599 | 0.9481 | | 3 | 0.1623 | 0.1295 | 0.7815 | 0.8408 | 0.8101 | 0.9637 | | 4 | 0.1291 | 0.1015 | 0.7836 | 0.8602 | 0.8201 | 0.9710 | | 5 | 0.0992 | 0.0965 | 0.8200 | 0.8943 | 0.8555 | 0.9719 | | 6 | 0.0801 | 0.0879 | 0.8299 | 0.9019 | 0.8644 | 0.9738 | | 7 | 0.0706 | 0.0819 | 0.8580 | 0.9137 | 0.8849 | 0.9765 | | 8 | 0.0636 | 0.0768 | 0.8660 | 0.9148 | 0.8897 | 0.9780 | | 9 | 0.0577 | 0.0757 | 0.8784 | 0.9202 | 0.8988 | 0.9784 | | 10 | 0.0527 | 0.0760 | 0.8737 | 0.9125 | 0.8927 | 0.9791 | | 11 | 0.0506 | 0.0785 | 0.8710 | 0.9236 | 0.8965 | 0.9775 | | 12 | 0.0470 | 0.0754 | 0.8830 | 0.9225 | 0.9023 | 0.9794 | | 13 | 0.0459 | 0.0754 | 0.8896 | 0.9231 | 0.9061 | 0.9802 | | 14 | 0.0441 | 0.0813 | 0.8742 | 0.9274 | 0.9000 | 0.9779 | | 15 | 0.0398 | 0.0763 | 0.8952 | 0.9247 | 0.9097 | 0.9812 | | 16 | 0.0387 | 0.0841 | 0.8713 | 0.9252 | 0.8974 | 0.9779 | | 17 | 0.0344 | 0.0805 | 0.8924 | 0.9258 | 0.9088 | 0.9805 | | 18 | 0.0356 | 0.0790 | 0.8854 | 0.9279 | 0.9061 | 0.9802 | | 19 | 0.0333 | 0.0801 | 0.8864 | 0.9249 | 0.9052 | 0.9806 | | 20 | 0.0326 | 0.0788 | 0.8939 | 0.9254 | 0.9094 | 0.9817 | | 21 | 0.0314 | 0.0801 | 0.8863 | 0.9263 | 0.9059 | 0.9808 | | 22 | 0.0309 | 0.0815 | 0.8866 | 0.9267 | 0.9062 | 0.9806 | | 23 | 0.0310 | 0.0825 | 0.8854 | 0.9281 | 0.9062 | 0.9804 | | 24 | 0.0280 | 0.0828 | 0.8874 | 0.9272 | 0.9068 | 0.9807 | | 25 | 0.0271 | 0.0826 | 0.8884 | 0.9276 | 0.9076 | 0.9809 | | 26 | 0.0290 | 0.0828 | 0.8887 | 0.9272 | 0.9075 | 0.9807 | | 27 | 0.0318 | 0.0835 | 0.8855 | 0.9256 | 0.9051 | 0.9803 | | 28 | 0.0287 | 0.0837 | 0.8871 | 0.9267 | 0.9065 | 0.9805 | | 29 | 0.0274 | 0.0837 | 0.8855 | 0.9272 | 0.9058 | 0.9804 | | 30 | 0.0271 | 0.0832 | 0.8875 | 0.9267 | 0.9067 | 0.9806 | ## Test Set Evaluation Evaluated on [myanmar-ner-dataset](https://huggingface.co/datasets/chuuhtetnaing/myanmar-ner-dataset) test split using seqeval metrics: | Entity | Precision | Recall | F1-Score | Support | |--------|-----------|--------|----------|---------| | DATE | 0.80 | 0.86 | 0.83 | 251 | | LOC | 0.93 | 0.96 | 0.95 | 2712 | | NUM | 0.89 | 0.92 | 0.90 | 789 | | ORG | 0.44 | 0.62 | 0.52 | 94 | | PER | 0.84 | 0.88 | 0.86 | 533 | | TIME | 0.62 | 0.70 | 0.66 | 57 | | **micro avg** | **0.89** | **0.93** | **0.91** | 4436 | | **macro avg** | 0.75 | 0.82 | 0.78 | 4436 | | **weighted avg** | **0.89** | **0.93** | **0.91** | 4436 | ## Training Details | Parameter | Value | |-----------|-------| | Base Model | chuuhtetnaing/myanmar-pos-model | | Total Epochs | 30 | | Total Steps | 510 | | Best Checkpoint | checkpoint-255 | | Best F1 | 0.9097 | ## Usage ```python from transformers import pipeline ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", grouped_entities=True) result = ner("ကိုမောင်သည်ရန်ကုန်မြို့သို့သွားသည်။") # Ko Maung went to Yangon city print(result) ``` ## Evaluation Code ```python !pip install seqeval from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer from datasets import load_dataset from tqdm import tqdm from seqeval.metrics import classification_report # Load model and tokenizer model = AutoModelForTokenClassification.from_pretrained("chuuhtetnaing/myanmar-ner-model") tokenizer = AutoTokenizer.from_pretrained("chuuhtetnaing/myanmar-ner-model") def tokenize_and_align_labels(examples): tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) labels = [] for i, label in enumerate(examples["ner_tags"]): word_ids = tokenized_inputs.word_ids(batch_index=i) previous_word_idx = None label_ids = [] for word_idx in word_ids: if word_idx is None: label_ids.append(-100) elif word_idx != previous_word_idx: label_ids.append(label[word_idx]) else: label_ids.append(-100) previous_word_idx = word_idx labels.append(label_ids) tokenized_inputs["labels"] = labels return tokenized_inputs # Load and tokenize dataset ner = pipeline("token-classification", model="chuuhtetnaing/myanmar-ner-model", aggregation_strategy=None) ds = load_dataset("chuuhtetnaing/myanmar-ner-dataset") tokenized_ds = ds.map(tokenize_and_align_labels, batched=True) test_ds = tokenized_ds["test"] # Get label mapping label_list = model.config.id2label y_true = [] y_pred = [] for example in tqdm(test_ds): tokens = tokenizer.convert_ids_to_tokens(example["input_ids"]) true_labels = [label_list[l] if l != -100 else "O" for l in example["labels"]] text = tokenizer.decode(example["input_ids"], skip_special_tokens=True) preds = ner(text) pred_labels = ["O"] * len(true_labels) for pred in preds: idx = pred["index"] if idx < len(pred_labels): pred_labels[idx] = pred["entity"] y_true.append([label_list[l] for l in example["labels"] if l != -100]) y_pred.append([p for p, l in zip(pred_labels, example["labels"]) if l != -100]) print(classification_report(y_true, y_pred)) ``` ## NER Labels | Tag | Description | |-----|-------------| | B-DATE | Beginning of Date | | I-DATE | Inside Date | | B-LOC | Beginning of Location | | I-LOC | Inside Location | | B-NUM | Beginning of Number | | I-NUM | Inside Number | | B-ORG | Beginning of Organization | | I-ORG | Inside Organization | | B-PER | Beginning of Person | | I-PER | Inside Person | | B-TIME | Beginning of Time | | I-TIME | Inside Time | | O | Outside (Not an entity) |