File size: 3,715 Bytes

---
license: apache-2.0
base_model: vennify/t5-base-grammar-correction
tags:
- grammar-correction
- t5
- explainable-ai
- text2text-generation
datasets:
- jfleg
widget:
- text: "gec: Hey how are you"
  example_title: "Basic Correction"
---

# Grammar Error Correction (GEC) with T5

This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the `transformers` library from Hugging Face for model handling and training.

## Table of Contents

- [Introduction](#introduction)
- [Dataset](#dataset)
- [Model Training](#model-training)
- [Explainable AI Judge](#explainable-ai-judge)
- [Evaluation](#evaluation)
- [Save and Download Model](#save-and-download-model)

## Introduction

This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers:

1.  **Data Preparation**: Extracting and parsing M2 format datasets.
2.  **Model Loading**: Loading a T5 model and tokenizer from Hugging Face.
3.  **Fine-tuning**: Training the T5 model on the prepared GEC dataset.
4.  **Explainable AI**: Implementing a custom `ExplainableAIJudge` to provide rationales for corrections.
5.  **Evaluation**: Setting up a basic evaluation framework for GEC and explanation quality.
6.  **Model Export**: Saving and downloading the fine-tuned model.

## Dataset

The notebook uses data from two sources, primarily targeting M2 format files:

-   **CoNLL-14 Shared Task Data**: Used for training (`conll14st-test-data.tar.gz`).
-   **WI+LOCNESS M2 Data (BEA-19)**: Also used for training (`wi+locness_v2.1.bea19.tar.gz`).

The `parse_m2` function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models.

## Model Training

The core of the GEC system is a T5 model. Specifically, it uses `vennify/t5-base-grammar-correction` as the base model.

### `GECDataset` Class

A custom `GECDataset` class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with `"gec: "` to prompt the T5 model for grammar correction.

### `Trainer` from Hugging Face

The `Trainer` API from `transformers` is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving.

## Explainable AI Judge

An `ExplainableAIJudge` class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages `difflib.SequenceMatcher` to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations.

### Error Mapping

The `error_map` dictionary translates internal error codes (e.g., `R:VERB:TENSE`, `R:PUNCT`) into descriptive explanations.

## Evaluation

The `evaluate_xgec` function provides metrics for both correction performance and explanation quality:

-   **Correction Metrics**: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth).
-   **Explanation Quality**: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth.

## Save and Download Model

After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a `gec_model.tar.gz` file, which can be downloaded using `google.colab.files.download` for deployment or further use.