KOCS - Khmer Orthographic Correction System

This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

Loss: 0.0810
Cer: 0.0214
Wer: 0.0518
Bleu: {'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Cer	Wer	Bleu
1.0649	1.0	842	0.3280	0.0732	0.1732	{'score': 71.9221872879795, 'counts': [57846, 46293, 36724, 28877], 'totals': [67547, 60815, 54145, 47716], 'precisions': [85.63814825232801, 76.12102277398668, 67.82528395973775, 60.51848436583117], 'bp': 1.0, 'sys_len': 67547, 'ref_len': 67463}
0.2503	2.0	1684	0.1941	0.0442	0.1100	{'score': 81.83581475148922, 'counts': [61144, 51308, 42671, 35147], 'totals': [66960, 60228, 53557, 47131], 'precisions': [91.3142174432497, 85.18961280467556, 79.67399219523125, 74.57299866330017], 'bp': 0.9925161967292249, 'sys_len': 66960, 'ref_len': 67463}
0.1299	3.0	2526	0.1461	0.0393	0.0906	{'score': 85.38397449977448, 'counts': [62316, 53167, 44914, 37556], 'totals': [67168, 60436, 53765, 47338], 'precisions': [92.77632205812291, 87.97240055596002, 83.53761740909513, 79.33584012843804], 'bp': 0.9956176582385678, 'sys_len': 67168, 'ref_len': 67463}
0.0876	4.0	3368	0.1189	0.0333	0.0754	{'score': 87.61990203829572, 'counts': [63031, 54246, 46254, 39048], 'totals': [66846, 60114, 53443, 47020], 'precisions': [94.292852227508, 90.23854676115381, 86.54828508878619, 83.04551254785198], 'bp': 0.990812296425952, 'sys_len': 66846, 'ref_len': 67463}
0.0531	5.0	4210	0.1024	0.0291	0.0667	{'score': 89.01138842411211, 'counts': [63577, 55021, 47137, 39975], 'totals': [67042, 60310, 53639, 47214], 'precisions': [94.8315981026819, 91.23031006466589, 87.8782229348049, 84.66768331427119], 'bp': 0.9937400301719442, 'sys_len': 67042, 'ref_len': 67463}
0.0357	6.0	5052	0.0933	0.0248	0.0596	{'score': 90.11989978537375, 'counts': [63951, 55579, 47830, 40753], 'totals': [67063, 60331, 53661, 47238], 'precisions': [95.3595872537763, 92.12345228821005, 89.1336352285645, 86.27164570896312], 'bp': 0.9940532117557648, 'sys_len': 67063, 'ref_len': 67463}
0.0262	7.0	5894	0.0874	0.0247	0.0575	{'score': 90.47303331001092, 'counts': [64075, 55759, 48046, 40983], 'totals': [67024, 60292, 53624, 47202], 'precisions': [95.60008355216041, 92.48158959729318, 89.5979412203491, 86.82471081733824], 'bp': 0.993471511214246, 'sys_len': 67024, 'ref_len': 67463}
0.0191	8.0	6736	0.0838	0.0225	0.0533	{'score': 91.04591374118674, 'counts': [64352, 56095, 48400, 41348], 'totals': [67157, 60425, 53753, 47328], 'precisions': [95.82322021531635, 92.83409184940008, 90.04148605659219, 87.36477349560514], 'bp': 0.9954538780005294, 'sys_len': 67157, 'ref_len': 67463}
0.015	9.0	7578	0.0817	0.0215	0.0529	{'score': 91.05623653418233, 'counts': [64338, 56094, 48410, 41355], 'totals': [67124, 60392, 53721, 47297], 'precisions': [95.84947261784161, 92.88316333289177, 90.11373578302712, 87.43683531725057], 'bp': 0.9949623770309173, 'sys_len': 67124, 'ref_len': 67463}
0.0135	10.0	8420	0.0810	0.0214	0.0518	{'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463}

Framework versions

Transformers 4.57.2
Pytorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Downloads last month: 1

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for S-Sethisak/khmer_orthography_correction_tokenized

Base model

facebook/mbart-large-50

Finetuned

(348)

this model

Evaluation results

Wer on khmer-orthography-correction-dataset
self-reported

0.052
Bleu on khmer-orthography-correction-dataset
self-reported

[object Object]