Khmer Orthographic Correction System using prohokbart

This model is a fine-tuned version of socheatasokhachan/khmerhomophonecorrector on the khmer-orthography-correction-dataset dataset. It achieves the following results on the evaluation set:

Loss: 0.1540
Cer: 0.0454
Wer: 0.0764
Bleu: {'score': 88.37864881306542, 'counts': [64517, 55903, 48031, 40893], 'totals': [68677, 61945, 55271, 48845], 'precisions': [93.94265911440512, 90.24618613285979, 86.90090644280002, 83.71993039205651], 'bp': 0.9972662912745179, 'sys_len': 68677, 'ref_len': 68865}

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 64
eval_batch_size: 64
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Cer	Wer	Bleu
1.1543	1.0	842	0.5033	0.1914	0.2689	{'score': 62.82546400460583, 'counts': [56912, 44643, 34848, 27026], 'totals': [73032, 66300, 59626, 53200], 'precisions': [77.92748384269909, 67.33484162895928, 58.44430282091705, 50.80075187969925], 'bp': 1.0, 'sys_len': 73032, 'ref_len': 68865}
0.43	2.0	1684	0.3281	0.0972	0.1553	{'score': 76.54462691979896, 'counts': [60236, 49617, 40577, 32912], 'totals': [68798, 62066, 55393, 48965], 'precisions': [87.55487078112736, 79.9423194663745, 73.25293809687145, 67.2153579087103], 'bp': 0.9990266085337779, 'sys_len': 68798, 'ref_len': 68865}
0.2375	3.0	2526	0.2498	0.0750	0.1250	{'score': 80.90214763360461, 'counts': [61693, 51872, 43277, 35789], 'totals': [68332, 61600, 54928, 48505], 'precisions': [90.2842006673301, 84.20779220779221, 78.78859598019226, 73.78414596433358], 'bp': 0.9922301900464362, 'sys_len': 68332, 'ref_len': 68865}
0.1618	4.0	3368	0.2091	0.0813	0.1164	{'score': 83.24476040050118, 'counts': [62799, 53450, 45139, 37780], 'totals': [69213, 62481, 55810, 49390], 'precisions': [90.73295479172988, 85.54600598581969, 80.87977065042107, 76.49321725045556], 'bp': 1.0, 'sys_len': 69213, 'ref_len': 68865}
0.1005	5.0	4210	0.1871	0.0628	0.1038	{'score': 85.18579647531065, 'counts': [63414, 54394, 46247, 38951], 'totals': [69067, 62335, 55663, 49238], 'precisions': [91.81519394211418, 87.26076842865164, 83.08391570702261, 79.10759982127625], 'bp': 1.0, 'sys_len': 69067, 'ref_len': 68865}
0.0707	6.0	5052	0.1717	0.0530	0.0860	{'score': 87.04061891243913, 'counts': [64029, 55189, 47185, 39963], 'totals': [68642, 61910, 55238, 48817], 'precisions': [93.27962471955945, 89.14391859150379, 85.42126796770339, 81.86287563758526], 'bp': 0.9967565316066235, 'sys_len': 68642, 'ref_len': 68865}
0.0527	7.0	5894	0.1627	0.0484	0.0815	{'score': 87.60749706030536, 'counts': [64257, 55500, 47532, 40335], 'totals': [68638, 61906, 55231, 48808], 'precisions': [93.61723826451819, 89.6520531127839, 86.06036465028697, 82.64014096049829], 'bp': 0.9966982568607425, 'sys_len': 68638, 'ref_len': 68865}
0.0389	8.0	6736	0.1577	0.0477	0.0791	{'score': 87.99224756503044, 'counts': [64384, 55703, 47784, 40613], 'totals': [68658, 61926, 55253, 48828], 'precisions': [93.77494246846689, 89.9509091496302, 86.48218196297033, 83.17563692963054], 'bp': 0.9969895967447535, 'sys_len': 68658, 'ref_len': 68865}
0.0316	9.0	7578	0.1557	0.0447	0.0751	{'score': 88.46513392198663, 'counts': [64559, 55949, 48062, 40918], 'totals': [68566, 61834, 55160, 48735], 'precisions': [94.15599568299157, 90.48258239803344, 87.13197969543147, 83.96019287986047], 'bp': 0.9956487324226654, 'sys_len': 68566, 'ref_len': 68865}
0.0297	10.0	8420	0.1540	0.0454	0.0764	{'score': 88.37864881306542, 'counts': [64517, 55903, 48031, 40893], 'totals': [68677, 61945, 55271, 48845], 'precisions': [93.94265911440512, 90.24618613285979, 86.90090644280002, 83.71993039205651], 'bp': 0.9972662912745179, 'sys_len': 68677, 'ref_len': 68865}

Framework versions

Transformers 4.57.2
Pytorch 2.9.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for S-Sethisak/KOCS_Prahok

Base model

nict-astrec-att/prahokbart_big

Finetuned

socheatasokhachan/khmerhomophonecorrector

Finetuned

(2)

this model

Evaluation results

Wer on khmer-orthography-correction-dataset
self-reported

0.076
Bleu on khmer-orthography-correction-dataset
self-reported

[object Object]