| --- |
| license: apache-2.0 |
| base_model: vennify/t5-base-grammar-correction |
| tags: |
| - grammar-correction |
| - t5 |
| - explainable-ai |
| - text2text-generation |
| datasets: |
| - jfleg |
| widget: |
| - text: "gec: Hey how are you" |
| example_title: "Basic Correction" |
| --- |
| |
| # Grammar Error Correction (GEC) with T5 |
|
|
| This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the `transformers` library from Hugging Face for model handling and training. |
|
|
| ## Table of Contents |
|
|
| - [Introduction](#introduction) |
| - [Dataset](#dataset) |
| - [Model Training](#model-training) |
| - [Explainable AI Judge](#explainable-ai-judge) |
| - [Evaluation](#evaluation) |
| - [Save and Download Model](#save-and-download-model) |
|
|
| ## Introduction |
|
|
| This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers: |
|
|
| 1. **Data Preparation**: Extracting and parsing M2 format datasets. |
| 2. **Model Loading**: Loading a T5 model and tokenizer from Hugging Face. |
| 3. **Fine-tuning**: Training the T5 model on the prepared GEC dataset. |
| 4. **Explainable AI**: Implementing a custom `ExplainableAIJudge` to provide rationales for corrections. |
| 5. **Evaluation**: Setting up a basic evaluation framework for GEC and explanation quality. |
| 6. **Model Export**: Saving and downloading the fine-tuned model. |
|
|
| ## Dataset |
|
|
| The notebook uses data from two sources, primarily targeting M2 format files: |
|
|
| - **CoNLL-14 Shared Task Data**: Used for training (`conll14st-test-data.tar.gz`). |
| - **WI+LOCNESS M2 Data (BEA-19)**: Also used for training (`wi+locness_v2.1.bea19.tar.gz`). |
|
|
| The `parse_m2` function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models. |
|
|
| ## Model Training |
|
|
| The core of the GEC system is a T5 model. Specifically, it uses `vennify/t5-base-grammar-correction` as the base model. |
|
|
| ### `GECDataset` Class |
|
|
| A custom `GECDataset` class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with `"gec: "` to prompt the T5 model for grammar correction. |
|
|
| ### `Trainer` from Hugging Face |
|
|
| The `Trainer` API from `transformers` is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving. |
|
|
| ## Explainable AI Judge |
|
|
| An `ExplainableAIJudge` class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages `difflib.SequenceMatcher` to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations. |
|
|
| ### Error Mapping |
|
|
| The `error_map` dictionary translates internal error codes (e.g., `R:VERB:TENSE`, `R:PUNCT`) into descriptive explanations. |
|
|
| ## Evaluation |
|
|
| The `evaluate_xgec` function provides metrics for both correction performance and explanation quality: |
|
|
| - **Correction Metrics**: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth). |
| - **Explanation Quality**: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth. |
|
|
| ## Save and Download Model |
|
|
| After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a `gec_model.tar.gz` file, which can be downloaded using `google.colab.files.download` for deployment or further use. |
|
|