pakawadeep
/

ctfl-gec-th

text2text-generation

grammatical-error-correction

generated_from_keras_callback

Eval Results (legacy)

Model card Files Files and versions

Metrics Training metrics Community

ctfl-gec-th / README.md

pakawadeep's picture

Update README.md

6c267b0 verified 11 months ago

|

history blame contribute delete

3.08 kB

	---
	license: apache-2.0
	base_model: google/mt5-large
	tags:
	- thai
	- grammatical-error-correction
	- mt5
	- fine-tuned
	- l2-learners
	- generated_from_keras_callback
	model-index:
	- name: pakawadeep/ctfl-gec-th
	results:
	- task:
	name: Grammatical Error Correction
	type: text2text-generation
	dataset:
	name: CTFL-GEC
	type: custom
	metrics:
	- name: Precision
	type: precision
	value: 0.47
	- name: Recall
	type: recall
	value: 0.47
	- name: F1
	type: f1
	value: 0.47
	- name: F0.5
	type: f0.5
	value: 0.47
	- name: BLEU
	type: bleu
	value: 0.69
	- name: GLEU
	type: gleu
	value: 0.68
	- name: CHRF
	type: chrf
	value: 0.87
	language:
	- th
	---

	# pakawadeep/ctfl-gec-th

	This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for Grammatical Error Correction (GEC) in Thai for L2 learners. It was developed as part of the research "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", and represents the best-performing model in the study.

	## Model description

	This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs.

	The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles.

	## Intended uses & limitations

	### Intended uses
	- Grammatical error correction for Thai language learners
	- Linguistic analysis of L2 learner errors
	- Research in low-resource GEC methods

	### Limitations
	- May not generalize to informal or dialectal Thai
	- Performance may degrade on sentence types or domains not represented in the training data
	- Designed for Thai GEC only; not optimized for multilingual correction tasks

	## Training and evaluation data

	The model was fine-tuned on a combined dataset consisting of:
	- CTFL-GEC: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences)
	- Self-Instruct augmentation (200%): Synthetic GEC pairs generated using LLM prompting

	Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics.

	## Training procedure

	### Training hyperparameters
	- Optimizer: AdamWeightDecay
	- Learning rate: 2e-5
	- Beta1/Beta2: 0.9 / 0.999
	- Epsilon: 1e-7
	- Weight decay: 0.01
	- Precision: float32

	### Framework versions
	- Transformers 4.41.2
	- TensorFlow 2.15.0
	- Datasets 2.20.0
	- Tokenizers 0.19.1

	## Citation

	If you use this model, please cite the associated thesis:

	```
	Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025.
	```