KOCS - Khmer Orthographic Correction System

This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0810
  • Cer: 0.0214
  • Wer: 0.0518
  • Bleu: {'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Cer Wer Bleu
1.0649 1.0 842 0.3280 0.0732 0.1732 {'score': 71.9221872879795, 'counts': [57846, 46293, 36724, 28877], 'totals': [67547, 60815, 54145, 47716], 'precisions': [85.63814825232801, 76.12102277398668, 67.82528395973775, 60.51848436583117], 'bp': 1.0, 'sys_len': 67547, 'ref_len': 67463}
0.2503 2.0 1684 0.1941 0.0442 0.1100 {'score': 81.83581475148922, 'counts': [61144, 51308, 42671, 35147], 'totals': [66960, 60228, 53557, 47131], 'precisions': [91.3142174432497, 85.18961280467556, 79.67399219523125, 74.57299866330017], 'bp': 0.9925161967292249, 'sys_len': 66960, 'ref_len': 67463}
0.1299 3.0 2526 0.1461 0.0393 0.0906 {'score': 85.38397449977448, 'counts': [62316, 53167, 44914, 37556], 'totals': [67168, 60436, 53765, 47338], 'precisions': [92.77632205812291, 87.97240055596002, 83.53761740909513, 79.33584012843804], 'bp': 0.9956176582385678, 'sys_len': 67168, 'ref_len': 67463}
0.0876 4.0 3368 0.1189 0.0333 0.0754 {'score': 87.61990203829572, 'counts': [63031, 54246, 46254, 39048], 'totals': [66846, 60114, 53443, 47020], 'precisions': [94.292852227508, 90.23854676115381, 86.54828508878619, 83.04551254785198], 'bp': 0.990812296425952, 'sys_len': 66846, 'ref_len': 67463}
0.0531 5.0 4210 0.1024 0.0291 0.0667 {'score': 89.01138842411211, 'counts': [63577, 55021, 47137, 39975], 'totals': [67042, 60310, 53639, 47214], 'precisions': [94.8315981026819, 91.23031006466589, 87.8782229348049, 84.66768331427119], 'bp': 0.9937400301719442, 'sys_len': 67042, 'ref_len': 67463}
0.0357 6.0 5052 0.0933 0.0248 0.0596 {'score': 90.11989978537375, 'counts': [63951, 55579, 47830, 40753], 'totals': [67063, 60331, 53661, 47238], 'precisions': [95.3595872537763, 92.12345228821005, 89.1336352285645, 86.27164570896312], 'bp': 0.9940532117557648, 'sys_len': 67063, 'ref_len': 67463}
0.0262 7.0 5894 0.0874 0.0247 0.0575 {'score': 90.47303331001092, 'counts': [64075, 55759, 48046, 40983], 'totals': [67024, 60292, 53624, 47202], 'precisions': [95.60008355216041, 92.48158959729318, 89.5979412203491, 86.82471081733824], 'bp': 0.993471511214246, 'sys_len': 67024, 'ref_len': 67463}
0.0191 8.0 6736 0.0838 0.0225 0.0533 {'score': 91.04591374118674, 'counts': [64352, 56095, 48400, 41348], 'totals': [67157, 60425, 53753, 47328], 'precisions': [95.82322021531635, 92.83409184940008, 90.04148605659219, 87.36477349560514], 'bp': 0.9954538780005294, 'sys_len': 67157, 'ref_len': 67463}
0.015 9.0 7578 0.0817 0.0215 0.0529 {'score': 91.05623653418233, 'counts': [64338, 56094, 48410, 41355], 'totals': [67124, 60392, 53721, 47297], 'precisions': [95.84947261784161, 92.88316333289177, 90.11373578302712, 87.43683531725057], 'bp': 0.9949623770309173, 'sys_len': 67124, 'ref_len': 67463}
0.0135 10.0 8420 0.0810 0.0214 0.0518 {'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463}

Framework versions

  • Transformers 4.57.2
  • Pytorch 2.9.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
7
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for S-Sethisak/khmer_orthography_correction_tokenized

Finetuned
(289)
this model

Evaluation results

  • Wer on khmer-orthography-correction-dataset
    self-reported
    0.052
  • Bleu on khmer-orthography-correction-dataset
    self-reported
    [object Object]