Myanmar Text Segmentation Model

Fine-tuned FacebookAI/xlm-roberta-base for Myanmar text segmentation (word boundary detection) using token classification.

Training Results

Step Training Loss Validation Loss Precision Recall F1 Accuracy
1000 0.1133 0.1145 0.7704 0.7464 0.7582 0.9564
2000 0.1115 0.1151 0.7620 0.7644 0.7632 0.9566
3000 0.1085 0.1163 0.7514 0.7724 0.7618 0.9558
4000 0.1076 0.1168 0.7791 0.7360 0.7569 0.9568
5000 0.1056 0.1118 0.7720 0.7604 0.7661 0.9576
6000 0.1141 0.1120 0.7751 0.7613 0.7682 0.9581
7000 0.1154 0.1108 0.7656 0.7717 0.7686 0.9576
8000 0.1128 0.1092 0.7744 0.7662 0.7703 0.9584
9000 0.1082 0.1095 0.7810 0.7620 0.7714 0.9589
10000 0.1101 0.1074 0.7817 0.7628 0.7721 0.9591
11000 0.1101 0.1079 0.7769 0.7721 0.7745 0.9591
12000 0.1058 0.1093 0.7672 0.7800 0.7735 0.9585
13000 0.1070 0.1075 0.7856 0.7622 0.7738 0.9596
14000 0.1055 0.1066 0.7824 0.7714 0.7769 0.9599
15000 0.1054 0.1065 0.7778 0.7724 0.7751 0.9593
16000 0.1038 0.1062 0.7775 0.7749 0.7762 0.9595
17000 0.1052 0.1053 0.7884 0.7635 0.7758 0.9600
18000 0.1061 0.1062 0.7796 0.7767 0.7781 0.9599
19000 0.1038 0.1063 0.7753 0.7811 0.7782 0.9597
20000 0.1017 0.1048 0.7848 0.7734 0.7791 0.9603
21000 0.1029 0.1046 0.7827 0.7774 0.7800 0.9603
22000 0.1018 0.1043 0.7844 0.7753 0.7798 0.9604
23000 0.1007 0.1043 0.7889 0.7689 0.7787 0.9605
24000 0.1008 0.1048 0.7859 0.7747 0.7803 0.9605
25000 0.1015 0.1039 0.7862 0.7759 0.7810 0.9607
26000 0.1008 0.1063 0.7832 0.7752 0.7792 0.9602
27000 0.1003 0.1039 0.7889 0.7728 0.7808 0.9608
28000 0.0998 0.1030 0.7830 0.7839 0.7834 0.9608
29000 0.0991 0.1038 0.7844 0.7817 0.7831 0.9608
30000 0.0971 0.1045 0.7891 0.7761 0.7825 0.9610
31000 0.0961 0.1047 0.7882 0.7761 0.7821 0.9609
32000 0.0958 0.1030 0.7869 0.7804 0.7837 0.9611
33000 0.0963 0.1037 0.7857 0.7801 0.7829 0.9609
34000 0.0972 0.1036 0.7910 0.7759 0.7834 0.9612
35000 0.0966 0.1030 0.7880 0.7804 0.7842 0.9612
36000 0.0972 0.1024 0.7900 0.7785 0.7842 0.9613
37000 0.0961 0.1036 0.7898 0.7780 0.7838 0.9613
38000 0.0968 0.1024 0.7892 0.7808 0.7850 0.9614
39000 0.0956 0.1032 0.7892 0.7810 0.7850 0.9614
40000 0.0946 0.1032 0.7866 0.7834 0.7850 0.9613
41000 0.0933 0.1028 0.7904 0.7806 0.7854 0.9615
42000 0.0928 0.1031 0.7923 0.7768 0.7845 0.9615
43000 0.0924 0.1034 0.7847 0.7874 0.7860 0.9613
44000 0.0939 0.1041 0.7830 0.7897 0.7863 0.9612
45000 0.0946 0.1029 0.7903 0.7828 0.7865 0.9617
46000 0.0939 0.1022 0.7901 0.7820 0.7860 0.9616
47000 0.0931 0.1026 0.7907 0.7799 0.7853 0.9615
48000 0.0927 0.1024 0.7916 0.7804 0.7860 0.9617
49000 0.0942 0.1022 0.7909 0.7831 0.7869 0.9618
50000 0.0930 0.1021 0.7910 0.7823 0.7866 0.9617
51000 0.0902 0.1035 0.7887 0.7854 0.7870 0.9617
52000 0.0907 0.1032 0.7917 0.7814 0.7865 0.9617
53000 0.0903 0.1034 0.7905 0.7830 0.7867 0.9617
54000 0.0908 0.1029 0.7904 0.7828 0.7866 0.9617

Training Details

Parameter Value
Base Model FacebookAI/xlm-roberta-base
Total Steps 54000
Epochs 4.961867132224571
Final Training Loss 0.090800
Best Checkpoint myanmar_text_segmentation_model/checkpoint-50000
Best F1 Score 0.1021
Learning Rate 2e-5
Batch Size 25
Weight Decay 0.01

Usage

Using Pipeline

from transformers import pipeline

nlp = pipeline("token-classification", model="chuuhtetnaing/myanmar-text-segmentation-model", grouped_entities=True)
segments = nlp("အချစ်ဆိုတာလူတွေရှင်သန်ဖို့သဘာဝကပေးတဲ့လက်နက်လား၊ဒါမှမဟုတ်ယဉ်ကျေးမှုအရတီထွင်ထားတဲ့စိတ်ကူးယဉ်မှုသက်သက်လား။")

segmented_text = []
for segment in segments:
    if segment["entity_group"] == "B":
        segmented_text.append(segment["word"])
    else:  # 'I' - append to previous word
        segmented_text[-1] += segment["word"]
segmented_text = " ".join(segmented_text)

print(segmented_text)

# အချစ်ဆိုတာ လူတွေရှင်သန်ဖို့ သဘာဝကပေးတဲ့လက်နက်လား၊ ဒါမှမဟုတ် ယဉ်ကျေးမှုအရ တီထွင်ထားတဲ့ စိတ်ကူးယဉ်မှုသက်သက်လား။

Label Mapping

  • B: Beginning of a word/segment
  • I: Inside a word/segment (continuation)

Dataset

Trained on chuuhtetnaing/myanmar-text-segmentation-dataset dataset.

Downloads last month
443
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chuuhtetnaing/myanmar-text-segmentation-model

Finetuned
(3697)
this model
Finetunes
1 model

Dataset used to train chuuhtetnaing/myanmar-text-segmentation-model

Space using chuuhtetnaing/myanmar-text-segmentation-model 1