| --- |
| license: apache-2.0 |
| base_model: google/mt5-large |
| tags: |
| - thai |
| - grammatical-error-correction |
| - mt5 |
| - fine-tuned |
| - l2-learners |
| - generated_from_keras_callback |
| model-index: |
| - name: pakawadeep/ctfl-gec-th |
| results: |
| - task: |
| name: Grammatical Error Correction |
| type: text2text-generation |
| dataset: |
| name: CTFL-GEC |
| type: custom |
| metrics: |
| - name: Precision |
| type: precision |
| value: 0.47 |
| - name: Recall |
| type: recall |
| value: 0.47 |
| - name: F1 |
| type: f1 |
| value: 0.47 |
| - name: F0.5 |
| type: f0.5 |
| value: 0.47 |
| - name: BLEU |
| type: bleu |
| value: 0.69 |
| - name: GLEU |
| type: gleu |
| value: 0.68 |
| - name: CHRF |
| type: chrf |
| value: 0.87 |
| language: |
| - th |
| --- |
| |
| # pakawadeep/ctfl-gec-th |
|
|
| This model is a fine-tuned version of [google/mt5-large](https://huggingface.co/google/mt5-large), trained for **Grammatical Error Correction (GEC)** in **Thai** for **L2 learners**. It was developed as part of the research *"Grammatical Error Correction for L2 Learners of Thai Using Large Language Models"*, and represents the best-performing model in the study. |
|
|
| ## Model description |
|
|
| This model is based on the mT5-large architecture and was fine-tuned on the CTFL-GEC dataset, which contains human-annotated grammatical error corrections from L2 Thai learners. To improve generalization, the dataset was augmented using the Self-Instruct method with 200% additional synthetic pairs. |
|
|
| The model is capable of correcting sentence-level grammatical errors typical of L2 Thai writing, including issues with word order, omissions, and incorrect particles. |
|
|
| ## Intended uses & limitations |
|
|
| ### Intended uses |
| - Grammatical error correction for Thai language learners |
| - Linguistic analysis of L2 learner errors |
| - Research in low-resource GEC methods |
|
|
| ### Limitations |
| - May not generalize to informal or dialectal Thai |
| - Performance may degrade on sentence types or domains not represented in the training data |
| - Designed for Thai GEC only; not optimized for multilingual correction tasks |
|
|
| ## Training and evaluation data |
|
|
| The model was fine-tuned on a combined dataset consisting of: |
| - **CTFL-GEC**: A manually annotated corpus of Thai learner writing (370 writing samples, 4,200+ sentences) |
| - **Self-Instruct augmentation (200%)**: Synthetic GEC pairs generated using LLM prompting |
|
|
| Evaluation was conducted on a held-out portion of the human-annotated dataset using common GEC metrics. |
|
|
| ## Training procedure |
|
|
| ### Training hyperparameters |
| - **Optimizer**: AdamWeightDecay |
| - **Learning rate**: 2e-5 |
| - **Beta1/Beta2**: 0.9 / 0.999 |
| - **Epsilon**: 1e-7 |
| - **Weight decay**: 0.01 |
| - **Precision**: float32 |
|
|
| ### Framework versions |
| - Transformers 4.41.2 |
| - TensorFlow 2.15.0 |
| - Datasets 2.20.0 |
| - Tokenizers 0.19.1 |
|
|
| ## Citation |
|
|
| If you use this model, please cite the associated thesis: |
|
|
| ``` |
| Pakawadee P. Chookwan, "Grammatical Error Correction for L2 Learners of Thai Using Large Language Models", 2025. |
| ``` |
|
|