Khmer Orthographic Correction System

This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1013
  • Cer: 0.0321
  • Wer: 0.2359
  • Bleu: {'score': 72.66611519208453, 'counts': [6093, 737, 253, 83], 'totals': [7625, 893, 348, 136], 'precisions': [79.90819672131147, 82.53079507278835, 72.70114942528735, 61.029411764705884], 'bp': 0.9880069233984698, 'sys_len': 7625, 'ref_len': 7717}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 128
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Cer Wer Bleu
No log 1.0 421 0.4522 0.1002 0.6031 {'score': 38.589730796703584, 'counts': [3569, 462, 119, 24], 'totals': [7617, 885, 305, 98], 'precisions': [46.855717474071156, 52.20338983050848, 39.01639344262295, 24.489795918367346], 'bp': 0.9869572749351041, 'sys_len': 7617, 'ref_len': 7717}
1.0158 2.0 842 0.2749 0.0659 0.4455 {'score': 54.87956594618987, 'counts': [4625, 553, 151, 27], 'totals': [7552, 820, 270, 63], 'precisions': [61.242055084745765, 67.4390243902439, 55.925925925925924, 42.857142857142854], 'bp': 0.9783884330909396, 'sys_len': 7552, 'ref_len': 7717}
0.3159 3.0 1263 0.2064 0.0568 0.3696 {'score': 61.63854064896909, 'counts': [5161, 609, 178, 35], 'totals': [7564, 832, 280, 71], 'precisions': [68.23109465891064, 73.19711538461539, 63.57142857142857, 49.29577464788732], 'bp': 0.979975808414236, 'sys_len': 7564, 'ref_len': 7717}
0.1777 4.0 1684 0.1643 0.0464 0.3146 {'score': 67.47531857206104, 'counts': [5528, 634, 189, 40], 'totals': [7548, 816, 275, 69], 'precisions': [73.23794382617912, 77.69607843137256, 68.72727272727273, 57.971014492753625], 'bp': 0.9778587594704692, 'sys_len': 7548, 'ref_len': 7717}
0.112 5.0 2105 0.1395 0.0418 0.2939 {'score': 67.88305130550198, 'counts': [5672, 652, 196, 44], 'totals': [7556, 824, 284, 78], 'precisions': [75.06617257808364, 79.12621359223301, 69.01408450704226, 56.41025641025641], 'bp': 0.978917832364576, 'sys_len': 7556, 'ref_len': 7717}
0.0773 6.0 2526 0.1231 0.0396 0.2595 {'score': 70.15946080367954, 'counts': [5923, 711, 239, 73], 'totals': [7616, 884, 339, 126], 'precisions': [77.77048319327731, 80.42986425339366, 70.50147492625369, 57.93650793650794], 'bp': 0.9868259923632937, 'sys_len': 7616, 'ref_len': 7717}
0.0773 7.0 2947 0.1134 0.0366 0.2529 {'score': 70.42368423223296, 'counts': [5972, 702, 226, 65], 'totals': [7598, 866, 322, 111], 'precisions': [78.59963148196894, 81.06235565819861, 70.1863354037267, 58.55855855855856], 'bp': 0.9844599952446486, 'sys_len': 7598, 'ref_len': 7717}
0.0545 8.0 3368 0.1062 0.0340 0.2381 {'score': 70.69066215372433, 'counts': [6074, 731, 252, 82], 'totals': [7635, 903, 357, 143], 'precisions': [79.554682383759, 80.95238095238095, 70.58823529411765, 57.34265734265734], 'bp': 0.9893174549233236, 'sys_len': 7635, 'ref_len': 7717}
0.0439 9.0 3789 0.1031 0.0334 0.2356 {'score': 73.64126012494513, 'counts': [6094, 753, 268, 95], 'totals': [7641, 909, 364, 151], 'precisions': [79.75395890590237, 82.83828382838284, 73.62637362637362, 62.913907284768214], 'bp': 0.9901029591676278, 'sys_len': 7641, 'ref_len': 7717}
0.0366 10.0 4210 0.1013 0.0321 0.2359 {'score': 72.66611519208453, 'counts': [6093, 737, 253, 83], 'totals': [7625, 893, 348, 136], 'precisions': [79.90819672131147, 82.53079507278835, 72.70114942528735, 61.029411764705884], 'bp': 0.9880069233984698, 'sys_len': 7625, 'ref_len': 7717}

Framework versions

  • Transformers 4.57.2
  • Pytorch 2.9.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
4
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for S-Sethisak/khmer_orthography_correction

Finetuned
(289)
this model

Evaluation results

  • Wer on khmer-orthography-correction-dataset
    self-reported
    0.236
  • Bleu on khmer-orthography-correction-dataset
    self-reported
    [object Object]