KOCS - Khmer Orthographic Correction System
This model is a fine-tuned version of facebook/mbart-large-50 on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:
- Loss: 0.0810
- Cer: 0.0214
- Wer: 0.0518
- Bleu: {'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463}
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Cer | Wer | Bleu |
|---|---|---|---|---|---|---|
| 1.0649 | 1.0 | 842 | 0.3280 | 0.0732 | 0.1732 | {'score': 71.9221872879795, 'counts': [57846, 46293, 36724, 28877], 'totals': [67547, 60815, 54145, 47716], 'precisions': [85.63814825232801, 76.12102277398668, 67.82528395973775, 60.51848436583117], 'bp': 1.0, 'sys_len': 67547, 'ref_len': 67463} |
| 0.2503 | 2.0 | 1684 | 0.1941 | 0.0442 | 0.1100 | {'score': 81.83581475148922, 'counts': [61144, 51308, 42671, 35147], 'totals': [66960, 60228, 53557, 47131], 'precisions': [91.3142174432497, 85.18961280467556, 79.67399219523125, 74.57299866330017], 'bp': 0.9925161967292249, 'sys_len': 66960, 'ref_len': 67463} |
| 0.1299 | 3.0 | 2526 | 0.1461 | 0.0393 | 0.0906 | {'score': 85.38397449977448, 'counts': [62316, 53167, 44914, 37556], 'totals': [67168, 60436, 53765, 47338], 'precisions': [92.77632205812291, 87.97240055596002, 83.53761740909513, 79.33584012843804], 'bp': 0.9956176582385678, 'sys_len': 67168, 'ref_len': 67463} |
| 0.0876 | 4.0 | 3368 | 0.1189 | 0.0333 | 0.0754 | {'score': 87.61990203829572, 'counts': [63031, 54246, 46254, 39048], 'totals': [66846, 60114, 53443, 47020], 'precisions': [94.292852227508, 90.23854676115381, 86.54828508878619, 83.04551254785198], 'bp': 0.990812296425952, 'sys_len': 66846, 'ref_len': 67463} |
| 0.0531 | 5.0 | 4210 | 0.1024 | 0.0291 | 0.0667 | {'score': 89.01138842411211, 'counts': [63577, 55021, 47137, 39975], 'totals': [67042, 60310, 53639, 47214], 'precisions': [94.8315981026819, 91.23031006466589, 87.8782229348049, 84.66768331427119], 'bp': 0.9937400301719442, 'sys_len': 67042, 'ref_len': 67463} |
| 0.0357 | 6.0 | 5052 | 0.0933 | 0.0248 | 0.0596 | {'score': 90.11989978537375, 'counts': [63951, 55579, 47830, 40753], 'totals': [67063, 60331, 53661, 47238], 'precisions': [95.3595872537763, 92.12345228821005, 89.1336352285645, 86.27164570896312], 'bp': 0.9940532117557648, 'sys_len': 67063, 'ref_len': 67463} |
| 0.0262 | 7.0 | 5894 | 0.0874 | 0.0247 | 0.0575 | {'score': 90.47303331001092, 'counts': [64075, 55759, 48046, 40983], 'totals': [67024, 60292, 53624, 47202], 'precisions': [95.60008355216041, 92.48158959729318, 89.5979412203491, 86.82471081733824], 'bp': 0.993471511214246, 'sys_len': 67024, 'ref_len': 67463} |
| 0.0191 | 8.0 | 6736 | 0.0838 | 0.0225 | 0.0533 | {'score': 91.04591374118674, 'counts': [64352, 56095, 48400, 41348], 'totals': [67157, 60425, 53753, 47328], 'precisions': [95.82322021531635, 92.83409184940008, 90.04148605659219, 87.36477349560514], 'bp': 0.9954538780005294, 'sys_len': 67157, 'ref_len': 67463} |
| 0.015 | 9.0 | 7578 | 0.0817 | 0.0215 | 0.0529 | {'score': 91.05623653418233, 'counts': [64338, 56094, 48410, 41355], 'totals': [67124, 60392, 53721, 47297], 'precisions': [95.84947261784161, 92.88316333289177, 90.11373578302712, 87.43683531725057], 'bp': 0.9949623770309173, 'sys_len': 67124, 'ref_len': 67463} |
| 0.0135 | 10.0 | 8420 | 0.0810 | 0.0214 | 0.0518 | {'score': 91.23760284388217, 'counts': [64405, 56188, 48516, 41469], 'totals': [67099, 60367, 53696, 47271], 'precisions': [95.98503703482913, 93.07734358175824, 90.3530989272944, 87.7260899917497], 'bp': 0.9945898677227208, 'sys_len': 67099, 'ref_len': 67463} |
Framework versions
- Transformers 4.57.2
- Pytorch 2.9.0+cu126
- Datasets 4.0.0
- Tokenizers 0.22.1
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for S-Sethisak/khmer_orthography_correction_tokenized
Base model
facebook/mbart-large-50Evaluation results
- Wer on khmer-orthography-correction-datasetself-reported0.052
- Bleu on khmer-orthography-correction-datasetself-reported[object Object]