Khmer Orthographic Correction System

This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

Loss: 0.1013
Cer: 0.0321
Wer: 0.2359
Bleu: {'score': 72.66611519208453, 'counts': [6093, 737, 253, 83], 'totals': [7625, 893, 348, 136], 'precisions': [79.90819672131147, 82.53079507278835, 72.70114942528735, 61.029411764705884], 'bp': 0.9880069233984698, 'sys_len': 7625, 'ref_len': 7717}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 128
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Cer	Wer	Bleu
No log	1.0	421	0.4522	0.1002	0.6031	{'score': 38.589730796703584, 'counts': [3569, 462, 119, 24], 'totals': [7617, 885, 305, 98], 'precisions': [46.855717474071156, 52.20338983050848, 39.01639344262295, 24.489795918367346], 'bp': 0.9869572749351041, 'sys_len': 7617, 'ref_len': 7717}
1.0158	2.0	842	0.2749	0.0659	0.4455	{'score': 54.87956594618987, 'counts': [4625, 553, 151, 27], 'totals': [7552, 820, 270, 63], 'precisions': [61.242055084745765, 67.4390243902439, 55.925925925925924, 42.857142857142854], 'bp': 0.9783884330909396, 'sys_len': 7552, 'ref_len': 7717}
0.3159	3.0	1263	0.2064	0.0568	0.3696	{'score': 61.63854064896909, 'counts': [5161, 609, 178, 35], 'totals': [7564, 832, 280, 71], 'precisions': [68.23109465891064, 73.19711538461539, 63.57142857142857, 49.29577464788732], 'bp': 0.979975808414236, 'sys_len': 7564, 'ref_len': 7717}
0.1777	4.0	1684	0.1643	0.0464	0.3146	{'score': 67.47531857206104, 'counts': [5528, 634, 189, 40], 'totals': [7548, 816, 275, 69], 'precisions': [73.23794382617912, 77.69607843137256, 68.72727272727273, 57.971014492753625], 'bp': 0.9778587594704692, 'sys_len': 7548, 'ref_len': 7717}
0.112	5.0	2105	0.1395	0.0418	0.2939	{'score': 67.88305130550198, 'counts': [5672, 652, 196, 44], 'totals': [7556, 824, 284, 78], 'precisions': [75.06617257808364, 79.12621359223301, 69.01408450704226, 56.41025641025641], 'bp': 0.978917832364576, 'sys_len': 7556, 'ref_len': 7717}
0.0773	6.0	2526	0.1231	0.0396	0.2595	{'score': 70.15946080367954, 'counts': [5923, 711, 239, 73], 'totals': [7616, 884, 339, 126], 'precisions': [77.77048319327731, 80.42986425339366, 70.50147492625369, 57.93650793650794], 'bp': 0.9868259923632937, 'sys_len': 7616, 'ref_len': 7717}
0.0773	7.0	2947	0.1134	0.0366	0.2529	{'score': 70.42368423223296, 'counts': [5972, 702, 226, 65], 'totals': [7598, 866, 322, 111], 'precisions': [78.59963148196894, 81.06235565819861, 70.1863354037267, 58.55855855855856], 'bp': 0.9844599952446486, 'sys_len': 7598, 'ref_len': 7717}
0.0545	8.0	3368	0.1062	0.0340	0.2381	{'score': 70.69066215372433, 'counts': [6074, 731, 252, 82], 'totals': [7635, 903, 357, 143], 'precisions': [79.554682383759, 80.95238095238095, 70.58823529411765, 57.34265734265734], 'bp': 0.9893174549233236, 'sys_len': 7635, 'ref_len': 7717}
0.0439	9.0	3789	0.1031	0.0334	0.2356	{'score': 73.64126012494513, 'counts': [6094, 753, 268, 95], 'totals': [7641, 909, 364, 151], 'precisions': [79.75395890590237, 82.83828382838284, 73.62637362637362, 62.913907284768214], 'bp': 0.9901029591676278, 'sys_len': 7641, 'ref_len': 7717}
0.0366	10.0	4210	0.1013	0.0321	0.2359	{'score': 72.66611519208453, 'counts': [6093, 737, 253, 83], 'totals': [7625, 893, 348, 136], 'precisions': [79.90819672131147, 82.53079507278835, 72.70114942528735, 61.029411764705884], 'bp': 0.9880069233984698, 'sys_len': 7625, 'ref_len': 7717}

Framework versions

Transformers 4.57.2
Pytorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Downloads last month: 2

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for S-Sethisak/khmer_orthography_correction

Base model

facebook/mbart-large-50

Finetuned

(348)

this model

Evaluation results

Wer on khmer-orthography-correction-dataset
self-reported

0.236
Bleu on khmer-orthography-correction-dataset
self-reported

[object Object]