Grammar Error Correction (GEC) with T5

This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the transformers library from Hugging Face for model handling and training.

Table of Contents

Introduction

This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers:

  1. Data Preparation: Extracting and parsing M2 format datasets.
  2. Model Loading: Loading a T5 model and tokenizer from Hugging Face.
  3. Fine-tuning: Training the T5 model on the prepared GEC dataset.
  4. Explainable AI: Implementing a custom ExplainableAIJudge to provide rationales for corrections.
  5. Evaluation: Setting up a basic evaluation framework for GEC and explanation quality.
  6. Model Export: Saving and downloading the fine-tuned model.

Dataset

The notebook uses data from two sources, primarily targeting M2 format files:

  • CoNLL-14 Shared Task Data: Used for training (conll14st-test-data.tar.gz).
  • WI+LOCNESS M2 Data (BEA-19): Also used for training (wi+locness_v2.1.bea19.tar.gz).

The parse_m2 function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models.

Model Training

The core of the GEC system is a T5 model. Specifically, it uses vennify/t5-base-grammar-correction as the base model.

GECDataset Class

A custom GECDataset class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with "gec: " to prompt the T5 model for grammar correction.

Trainer from Hugging Face

The Trainer API from transformers is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving.

Explainable AI Judge

An ExplainableAIJudge class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages difflib.SequenceMatcher to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations.

Error Mapping

The error_map dictionary translates internal error codes (e.g., R:VERB:TENSE, R:PUNCT) into descriptive explanations.

Evaluation

The evaluate_xgec function provides metrics for both correction performance and explanation quality:

  • Correction Metrics: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth).
  • Explanation Quality: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth.

Save and Download Model

After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a gec_model.tar.gz file, which can be downloaded using google.colab.files.download for deployment or further use.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Wisteria86/expectedgec

Finetuned
(2)
this model

Dataset used to train Wisteria86/expectedgec