Grammar Error Correction (GEC) with T5
This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the transformers library from Hugging Face for model handling and training.
Table of Contents
Introduction
This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers:
- Data Preparation: Extracting and parsing M2 format datasets.
- Model Loading: Loading a T5 model and tokenizer from Hugging Face.
- Fine-tuning: Training the T5 model on the prepared GEC dataset.
- Explainable AI: Implementing a custom
ExplainableAIJudgeto provide rationales for corrections. - Evaluation: Setting up a basic evaluation framework for GEC and explanation quality.
- Model Export: Saving and downloading the fine-tuned model.
Dataset
The notebook uses data from two sources, primarily targeting M2 format files:
- CoNLL-14 Shared Task Data: Used for training (
conll14st-test-data.tar.gz). - WI+LOCNESS M2 Data (BEA-19): Also used for training (
wi+locness_v2.1.bea19.tar.gz).
The parse_m2 function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models.
Model Training
The core of the GEC system is a T5 model. Specifically, it uses vennify/t5-base-grammar-correction as the base model.
GECDataset Class
A custom GECDataset class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with "gec: " to prompt the T5 model for grammar correction.
Trainer from Hugging Face
The Trainer API from transformers is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving.
Explainable AI Judge
An ExplainableAIJudge class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages difflib.SequenceMatcher to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations.
Error Mapping
The error_map dictionary translates internal error codes (e.g., R:VERB:TENSE, R:PUNCT) into descriptive explanations.
Evaluation
The evaluate_xgec function provides metrics for both correction performance and explanation quality:
- Correction Metrics: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth).
- Explanation Quality: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth.
Save and Download Model
After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a gec_model.tar.gz file, which can be downloaded using google.colab.files.download for deployment or further use.
- Downloads last month
- 54
Model tree for ATG2222/expectedgec
Base model
vennify/t5-base-grammar-correction