--- license: apache-2.0 base_model: vennify/t5-base-grammar-correction tags: - grammar-correction - t5 - explainable-ai - text2text-generation datasets: - jfleg widget: - text: "gec: Hey how are you" example_title: "Basic Correction" --- # Grammar Error Correction (GEC) with T5 This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the `transformers` library from Hugging Face for model handling and training. ## Table of Contents - [Introduction](#introduction) - [Dataset](#dataset) - [Model Training](#model-training) - [Explainable AI Judge](#explainable-ai-judge) - [Evaluation](#evaluation) - [Save and Download Model](#save-and-download-model) ## Introduction This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers: 1. **Data Preparation**: Extracting and parsing M2 format datasets. 2. **Model Loading**: Loading a T5 model and tokenizer from Hugging Face. 3. **Fine-tuning**: Training the T5 model on the prepared GEC dataset. 4. **Explainable AI**: Implementing a custom `ExplainableAIJudge` to provide rationales for corrections. 5. **Evaluation**: Setting up a basic evaluation framework for GEC and explanation quality. 6. **Model Export**: Saving and downloading the fine-tuned model. ## Dataset The notebook uses data from two sources, primarily targeting M2 format files: - **CoNLL-14 Shared Task Data**: Used for training (`conll14st-test-data.tar.gz`). - **WI+LOCNESS M2 Data (BEA-19)**: Also used for training (`wi+locness_v2.1.bea19.tar.gz`). The `parse_m2` function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models. ## Model Training The core of the GEC system is a T5 model. Specifically, it uses `vennify/t5-base-grammar-correction` as the base model. ### `GECDataset` Class A custom `GECDataset` class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with `"gec: "` to prompt the T5 model for grammar correction. ### `Trainer` from Hugging Face The `Trainer` API from `transformers` is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving. ## Explainable AI Judge An `ExplainableAIJudge` class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages `difflib.SequenceMatcher` to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations. ### Error Mapping The `error_map` dictionary translates internal error codes (e.g., `R:VERB:TENSE`, `R:PUNCT`) into descriptive explanations. ## Evaluation The `evaluate_xgec` function provides metrics for both correction performance and explanation quality: - **Correction Metrics**: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth). - **Explanation Quality**: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth. ## Save and Download Model After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a `gec_model.tar.gz` file, which can be downloaded using `google.colab.files.download` for deployment or further use.