Khmer Orthographic Correction System using NLLB

This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2433
  • Cer: 0.0612
  • Wer: 0.4716
  • Bleu: {'score': 54.58228361171526, 'counts': [4449, 517, 149, 31], 'totals': [7551, 819, 273, 65], 'precisions': [58.919348430671434, 63.12576312576313, 54.57875457875458, 47.69230769230769], 'bp': 0.9785151810825774, 'sys_len': 7551, 'ref_len': 7715}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 128
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Cer Wer Bleu
No log 1.0 421 0.5867 0.1395 0.7613 {'score': 31.163336304191414, 'counts': [2533, 357, 88, 18], 'totals': [7662, 930, 292, 71], 'precisions': [33.05925345862699, 38.38709677419355, 30.136986301369863, 25.35211267605634], 'bp': 0.9931066151526365, 'sys_len': 7662, 'ref_len': 7715}
0.8766 2.0 842 0.4476 0.1080 0.6706 {'score': 35.87207143179994, 'counts': [3104, 395, 99, 18], 'totals': [7600, 868, 281, 67], 'precisions': [40.8421052631579, 45.50691244239631, 35.23131672597865, 26.865671641791046], 'bp': 0.9849823281382666, 'sys_len': 7600, 'ref_len': 7715}
0.5573 3.0 1263 0.3727 0.0906 0.6090 {'score': 42.601695587857506, 'counts': [3510, 436, 115, 22], 'totals': [7572, 840, 272, 63], 'precisions': [46.35499207606973, 51.904761904761905, 42.279411764705884, 34.92063492063492], 'bp': 0.9812918440841442, 'sys_len': 7572, 'ref_len': 7715}
0.4428 4.0 1684 0.3265 0.0819 0.5647 {'score': 47.75763819840498, 'counts': [3809, 467, 130, 26], 'totals': [7560, 828, 270, 63], 'precisions': [50.383597883597886, 56.40096618357488, 48.148148148148145, 41.26984126984127], 'bp': 0.9797061046559977, 'sys_len': 7560, 'ref_len': 7715}
0.3718 5.0 2105 0.2955 0.0732 0.5332 {'score': 50.28390617740271, 'counts': [4028, 486, 136, 28], 'totals': [7556, 824, 269, 64], 'precisions': [53.30862890418211, 58.980582524271846, 50.55762081784387, 43.75], 'bp': 0.9791769767263626, 'sys_len': 7556, 'ref_len': 7715}
0.3321 6.0 2526 0.2745 0.0681 0.5079 {'score': 53.18765542932397, 'counts': [4198, 499, 143, 31], 'totals': [7550, 818, 269, 64], 'precisions': [55.602649006622514, 61.002444987775064, 53.15985130111524, 48.4375], 'bp': 0.9783827705016936, 'sys_len': 7550, 'ref_len': 7715}
0.3321 7.0 2947 0.2600 0.0653 0.4901 {'score': 53.42672975882541, 'counts': [4320, 505, 145, 31], 'totals': [7548, 816, 271, 66], 'precisions': [57.233704292527825, 61.88725490196079, 53.505535055350556, 46.96969696969697], 'bp': 0.9781178978708265, 'sys_len': 7548, 'ref_len': 7715}
0.3023 8.0 3368 0.2503 0.0636 0.4763 {'score': 54.40154581493312, 'counts': [4417, 512, 149, 31], 'totals': [7551, 819, 272, 65], 'precisions': [58.49556350152298, 62.51526251526251, 54.779411764705884, 47.69230769230769], 'bp': 0.9785151810825774, 'sys_len': 7551, 'ref_len': 7715}
0.2841 9.0 3789 0.2452 0.0621 0.4727 {'score': 54.86859193873856, 'counts': [4439, 514, 149, 31], 'totals': [7548, 816, 270, 64], 'precisions': [58.810280869104396, 62.990196078431374, 55.18518518518518, 48.4375], 'bp': 0.9781178978708265, 'sys_len': 7548, 'ref_len': 7715}
0.2744 10.0 4210 0.2433 0.0612 0.4716 {'score': 54.58228361171526, 'counts': [4449, 517, 149, 31], 'totals': [7551, 819, 273, 65], 'precisions': [58.919348430671434, 63.12576312576313, 54.57875457875458, 47.69230769230769], 'bp': 0.9785151810825774, 'sys_len': 7551, 'ref_len': 7715}

Framework versions

  • Transformers 4.57.2
  • Pytorch 2.9.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
3
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for S-Sethisak/khmer_orthography_correction_NLLB

Finetuned
(289)
this model

Evaluation results

  • Wer on khmer-orthography-correction-dataset
    self-reported
    0.472
  • Bleu on khmer-orthography-correction-dataset
    self-reported
    [object Object]