Myanmar Text Segmentation Model

Fine-tuned FacebookAI/xlm-roberta-base for Myanmar text segmentation (word boundary detection) using token classification.

Training Results

Step	Training Loss	Validation Loss	Precision	Recall	F1	Accuracy
1000	0.1133	0.1145	0.7704	0.7464	0.7582	0.9564
2000	0.1115	0.1151	0.7620	0.7644	0.7632	0.9566
3000	0.1085	0.1163	0.7514	0.7724	0.7618	0.9558
4000	0.1076	0.1168	0.7791	0.7360	0.7569	0.9568
5000	0.1056	0.1118	0.7720	0.7604	0.7661	0.9576
6000	0.1141	0.1120	0.7751	0.7613	0.7682	0.9581
7000	0.1154	0.1108	0.7656	0.7717	0.7686	0.9576
8000	0.1128	0.1092	0.7744	0.7662	0.7703	0.9584
9000	0.1082	0.1095	0.7810	0.7620	0.7714	0.9589
10000	0.1101	0.1074	0.7817	0.7628	0.7721	0.9591
11000	0.1101	0.1079	0.7769	0.7721	0.7745	0.9591
12000	0.1058	0.1093	0.7672	0.7800	0.7735	0.9585
13000	0.1070	0.1075	0.7856	0.7622	0.7738	0.9596
14000	0.1055	0.1066	0.7824	0.7714	0.7769	0.9599
15000	0.1054	0.1065	0.7778	0.7724	0.7751	0.9593
16000	0.1038	0.1062	0.7775	0.7749	0.7762	0.9595
17000	0.1052	0.1053	0.7884	0.7635	0.7758	0.9600
18000	0.1061	0.1062	0.7796	0.7767	0.7781	0.9599
19000	0.1038	0.1063	0.7753	0.7811	0.7782	0.9597
20000	0.1017	0.1048	0.7848	0.7734	0.7791	0.9603
21000	0.1029	0.1046	0.7827	0.7774	0.7800	0.9603
22000	0.1018	0.1043	0.7844	0.7753	0.7798	0.9604
23000	0.1007	0.1043	0.7889	0.7689	0.7787	0.9605
24000	0.1008	0.1048	0.7859	0.7747	0.7803	0.9605
25000	0.1015	0.1039	0.7862	0.7759	0.7810	0.9607
26000	0.1008	0.1063	0.7832	0.7752	0.7792	0.9602
27000	0.1003	0.1039	0.7889	0.7728	0.7808	0.9608
28000	0.0998	0.1030	0.7830	0.7839	0.7834	0.9608
29000	0.0991	0.1038	0.7844	0.7817	0.7831	0.9608
30000	0.0971	0.1045	0.7891	0.7761	0.7825	0.9610
31000	0.0961	0.1047	0.7882	0.7761	0.7821	0.9609
32000	0.0958	0.1030	0.7869	0.7804	0.7837	0.9611
33000	0.0963	0.1037	0.7857	0.7801	0.7829	0.9609
34000	0.0972	0.1036	0.7910	0.7759	0.7834	0.9612
35000	0.0966	0.1030	0.7880	0.7804	0.7842	0.9612
36000	0.0972	0.1024	0.7900	0.7785	0.7842	0.9613
37000	0.0961	0.1036	0.7898	0.7780	0.7838	0.9613
38000	0.0968	0.1024	0.7892	0.7808	0.7850	0.9614
39000	0.0956	0.1032	0.7892	0.7810	0.7850	0.9614
40000	0.0946	0.1032	0.7866	0.7834	0.7850	0.9613
41000	0.0933	0.1028	0.7904	0.7806	0.7854	0.9615
42000	0.0928	0.1031	0.7923	0.7768	0.7845	0.9615
43000	0.0924	0.1034	0.7847	0.7874	0.7860	0.9613
44000	0.0939	0.1041	0.7830	0.7897	0.7863	0.9612
45000	0.0946	0.1029	0.7903	0.7828	0.7865	0.9617
46000	0.0939	0.1022	0.7901	0.7820	0.7860	0.9616
47000	0.0931	0.1026	0.7907	0.7799	0.7853	0.9615
48000	0.0927	0.1024	0.7916	0.7804	0.7860	0.9617
49000	0.0942	0.1022	0.7909	0.7831	0.7869	0.9618
50000	0.0930	0.1021	0.7910	0.7823	0.7866	0.9617
51000	0.0902	0.1035	0.7887	0.7854	0.7870	0.9617
52000	0.0907	0.1032	0.7917	0.7814	0.7865	0.9617
53000	0.0903	0.1034	0.7905	0.7830	0.7867	0.9617
54000	0.0908	0.1029	0.7904	0.7828	0.7866	0.9617

Training Details

Parameter	Value
Base Model	FacebookAI/xlm-roberta-base
Total Steps	54000
Epochs	4.961867132224571
Final Training Loss	0.090800
Best Checkpoint	myanmar_text_segmentation_model/checkpoint-50000
Best F1 Score	0.1021
Learning Rate	2e-5
Batch Size	25
Weight Decay	0.01

Usage

Using Pipeline

from transformers import pipeline

nlp = pipeline("token-classification", model="chuuhtetnaing/myanmar-text-segmentation-model", grouped_entities=True)
segments = nlp("အချစ်ဆိုတာလူတွေရှင်သန်ဖို့သဘာဝကပေးတဲ့လက်နက်လား၊ဒါမှမဟုတ်ယဉ်ကျေးမှုအရတီထွင်ထားတဲ့စိတ်ကူးယဉ်မှုသက်သက်လား။")

segmented_text = []
for segment in segments:
    if segment["entity_group"] == "B":
        segmented_text.append(segment["word"])
    else:  # 'I' - append to previous word
        segmented_text[-1] += segment["word"]
segmented_text = " ".join(segmented_text)

print(segmented_text)

# အချစ်ဆိုတာ လူတွေရှင်သန်ဖို့ သဘာဝကပေးတဲ့လက်နက်လား၊ ဒါမှမဟုတ် ယဉ်ကျေးမှုအရ တီထွင်ထားတဲ့ စိတ်ကူးယဉ်မှုသက်သက်လား။

Label Mapping

B: Beginning of a word/segment
I: Inside a word/segment (continuation)

Dataset

Trained on chuuhtetnaing/myanmar-text-segmentation-dataset dataset.

Downloads last month: 67

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for chuuhtetnaing/myanmar-text-segmentation-model

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4079)

this model

Finetunes

1 model

chuuhtetnaing
/

myanmar-text-segmentation-model