Myanmar Text Segmentation Model
Fine-tuned FacebookAI/xlm-roberta-base for Myanmar text segmentation (word boundary detection) using token classification.
Training Results
| Step | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|
| 1000 | 0.1133 | 0.1145 | 0.7704 | 0.7464 | 0.7582 | 0.9564 |
| 2000 | 0.1115 | 0.1151 | 0.7620 | 0.7644 | 0.7632 | 0.9566 |
| 3000 | 0.1085 | 0.1163 | 0.7514 | 0.7724 | 0.7618 | 0.9558 |
| 4000 | 0.1076 | 0.1168 | 0.7791 | 0.7360 | 0.7569 | 0.9568 |
| 5000 | 0.1056 | 0.1118 | 0.7720 | 0.7604 | 0.7661 | 0.9576 |
| 6000 | 0.1141 | 0.1120 | 0.7751 | 0.7613 | 0.7682 | 0.9581 |
| 7000 | 0.1154 | 0.1108 | 0.7656 | 0.7717 | 0.7686 | 0.9576 |
| 8000 | 0.1128 | 0.1092 | 0.7744 | 0.7662 | 0.7703 | 0.9584 |
| 9000 | 0.1082 | 0.1095 | 0.7810 | 0.7620 | 0.7714 | 0.9589 |
| 10000 | 0.1101 | 0.1074 | 0.7817 | 0.7628 | 0.7721 | 0.9591 |
| 11000 | 0.1101 | 0.1079 | 0.7769 | 0.7721 | 0.7745 | 0.9591 |
| 12000 | 0.1058 | 0.1093 | 0.7672 | 0.7800 | 0.7735 | 0.9585 |
| 13000 | 0.1070 | 0.1075 | 0.7856 | 0.7622 | 0.7738 | 0.9596 |
| 14000 | 0.1055 | 0.1066 | 0.7824 | 0.7714 | 0.7769 | 0.9599 |
| 15000 | 0.1054 | 0.1065 | 0.7778 | 0.7724 | 0.7751 | 0.9593 |
| 16000 | 0.1038 | 0.1062 | 0.7775 | 0.7749 | 0.7762 | 0.9595 |
| 17000 | 0.1052 | 0.1053 | 0.7884 | 0.7635 | 0.7758 | 0.9600 |
| 18000 | 0.1061 | 0.1062 | 0.7796 | 0.7767 | 0.7781 | 0.9599 |
| 19000 | 0.1038 | 0.1063 | 0.7753 | 0.7811 | 0.7782 | 0.9597 |
| 20000 | 0.1017 | 0.1048 | 0.7848 | 0.7734 | 0.7791 | 0.9603 |
| 21000 | 0.1029 | 0.1046 | 0.7827 | 0.7774 | 0.7800 | 0.9603 |
| 22000 | 0.1018 | 0.1043 | 0.7844 | 0.7753 | 0.7798 | 0.9604 |
| 23000 | 0.1007 | 0.1043 | 0.7889 | 0.7689 | 0.7787 | 0.9605 |
| 24000 | 0.1008 | 0.1048 | 0.7859 | 0.7747 | 0.7803 | 0.9605 |
| 25000 | 0.1015 | 0.1039 | 0.7862 | 0.7759 | 0.7810 | 0.9607 |
| 26000 | 0.1008 | 0.1063 | 0.7832 | 0.7752 | 0.7792 | 0.9602 |
| 27000 | 0.1003 | 0.1039 | 0.7889 | 0.7728 | 0.7808 | 0.9608 |
| 28000 | 0.0998 | 0.1030 | 0.7830 | 0.7839 | 0.7834 | 0.9608 |
| 29000 | 0.0991 | 0.1038 | 0.7844 | 0.7817 | 0.7831 | 0.9608 |
| 30000 | 0.0971 | 0.1045 | 0.7891 | 0.7761 | 0.7825 | 0.9610 |
| 31000 | 0.0961 | 0.1047 | 0.7882 | 0.7761 | 0.7821 | 0.9609 |
| 32000 | 0.0958 | 0.1030 | 0.7869 | 0.7804 | 0.7837 | 0.9611 |
| 33000 | 0.0963 | 0.1037 | 0.7857 | 0.7801 | 0.7829 | 0.9609 |
| 34000 | 0.0972 | 0.1036 | 0.7910 | 0.7759 | 0.7834 | 0.9612 |
| 35000 | 0.0966 | 0.1030 | 0.7880 | 0.7804 | 0.7842 | 0.9612 |
| 36000 | 0.0972 | 0.1024 | 0.7900 | 0.7785 | 0.7842 | 0.9613 |
| 37000 | 0.0961 | 0.1036 | 0.7898 | 0.7780 | 0.7838 | 0.9613 |
| 38000 | 0.0968 | 0.1024 | 0.7892 | 0.7808 | 0.7850 | 0.9614 |
| 39000 | 0.0956 | 0.1032 | 0.7892 | 0.7810 | 0.7850 | 0.9614 |
| 40000 | 0.0946 | 0.1032 | 0.7866 | 0.7834 | 0.7850 | 0.9613 |
| 41000 | 0.0933 | 0.1028 | 0.7904 | 0.7806 | 0.7854 | 0.9615 |
| 42000 | 0.0928 | 0.1031 | 0.7923 | 0.7768 | 0.7845 | 0.9615 |
| 43000 | 0.0924 | 0.1034 | 0.7847 | 0.7874 | 0.7860 | 0.9613 |
| 44000 | 0.0939 | 0.1041 | 0.7830 | 0.7897 | 0.7863 | 0.9612 |
| 45000 | 0.0946 | 0.1029 | 0.7903 | 0.7828 | 0.7865 | 0.9617 |
| 46000 | 0.0939 | 0.1022 | 0.7901 | 0.7820 | 0.7860 | 0.9616 |
| 47000 | 0.0931 | 0.1026 | 0.7907 | 0.7799 | 0.7853 | 0.9615 |
| 48000 | 0.0927 | 0.1024 | 0.7916 | 0.7804 | 0.7860 | 0.9617 |
| 49000 | 0.0942 | 0.1022 | 0.7909 | 0.7831 | 0.7869 | 0.9618 |
| 50000 | 0.0930 | 0.1021 | 0.7910 | 0.7823 | 0.7866 | 0.9617 |
| 51000 | 0.0902 | 0.1035 | 0.7887 | 0.7854 | 0.7870 | 0.9617 |
| 52000 | 0.0907 | 0.1032 | 0.7917 | 0.7814 | 0.7865 | 0.9617 |
| 53000 | 0.0903 | 0.1034 | 0.7905 | 0.7830 | 0.7867 | 0.9617 |
| 54000 | 0.0908 | 0.1029 | 0.7904 | 0.7828 | 0.7866 | 0.9617 |
Training Details
| Parameter | Value |
|---|---|
| Base Model | FacebookAI/xlm-roberta-base |
| Total Steps | 54000 |
| Epochs | 4.961867132224571 |
| Final Training Loss | 0.090800 |
| Best Checkpoint | myanmar_text_segmentation_model/checkpoint-50000 |
| Best F1 Score | 0.1021 |
| Learning Rate | 2e-5 |
| Batch Size | 25 |
| Weight Decay | 0.01 |
Usage
Using Pipeline
from transformers import pipeline
nlp = pipeline("token-classification", model="chuuhtetnaing/myanmar-text-segmentation-model", grouped_entities=True)
segments = nlp("အချစ်ဆိုတာလူတွေရှင်သန်ဖို့သဘာဝကပေးတဲ့လက်နက်လား၊ဒါမှမဟုတ်ယဉ်ကျေးမှုအရတီထွင်ထားတဲ့စိတ်ကူးယဉ်မှုသက်သက်လား။")
segmented_text = []
for segment in segments:
if segment["entity_group"] == "B":
segmented_text.append(segment["word"])
else: # 'I' - append to previous word
segmented_text[-1] += segment["word"]
segmented_text = " ".join(segmented_text)
print(segmented_text)
# အချစ်ဆိုတာ လူတွေရှင်သန်ဖို့ သဘာဝကပေးတဲ့လက်နက်လား၊ ဒါမှမဟုတ် ယဉ်ကျေးမှုအရ တီထွင်ထားတဲ့ စိတ်ကူးယဉ်မှုသက်သက်လား။
Label Mapping
B: Beginning of a word/segmentI: Inside a word/segment (continuation)
Dataset
Trained on chuuhtetnaing/myanmar-text-segmentation-dataset dataset.
- Downloads last month
- 443