expectedgec / README.md
ATG2222's picture
Update README.md
25cbb4e verified
---
license: apache-2.0
base_model: vennify/t5-base-grammar-correction
tags:
- grammar-correction
- t5
- explainable-ai
- text2text-generation
datasets:
- jfleg
widget:
- text: "gec: Hey how are you"
example_title: "Basic Correction"
---
# Grammar Error Correction (GEC) with T5
This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the `transformers` library from Hugging Face for model handling and training.
## Table of Contents
- [Introduction](#introduction)
- [Dataset](#dataset)
- [Model Training](#model-training)
- [Explainable AI Judge](#explainable-ai-judge)
- [Evaluation](#evaluation)
- [Save and Download Model](#save-and-download-model)
## Introduction
This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers:
1. **Data Preparation**: Extracting and parsing M2 format datasets.
2. **Model Loading**: Loading a T5 model and tokenizer from Hugging Face.
3. **Fine-tuning**: Training the T5 model on the prepared GEC dataset.
4. **Explainable AI**: Implementing a custom `ExplainableAIJudge` to provide rationales for corrections.
5. **Evaluation**: Setting up a basic evaluation framework for GEC and explanation quality.
6. **Model Export**: Saving and downloading the fine-tuned model.
## Dataset
The notebook uses data from two sources, primarily targeting M2 format files:
- **CoNLL-14 Shared Task Data**: Used for training (`conll14st-test-data.tar.gz`).
- **WI+LOCNESS M2 Data (BEA-19)**: Also used for training (`wi+locness_v2.1.bea19.tar.gz`).
The `parse_m2` function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models.
## Model Training
The core of the GEC system is a T5 model. Specifically, it uses `vennify/t5-base-grammar-correction` as the base model.
### `GECDataset` Class
A custom `GECDataset` class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with `"gec: "` to prompt the T5 model for grammar correction.
### `Trainer` from Hugging Face
The `Trainer` API from `transformers` is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving.
## Explainable AI Judge
An `ExplainableAIJudge` class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages `difflib.SequenceMatcher` to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations.
### Error Mapping
The `error_map` dictionary translates internal error codes (e.g., `R:VERB:TENSE`, `R:PUNCT`) into descriptive explanations.
## Evaluation
The `evaluate_xgec` function provides metrics for both correction performance and explanation quality:
- **Correction Metrics**: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth).
- **Explanation Quality**: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth.
## Save and Download Model
After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a `gec_model.tar.gz` file, which can be downloaded using `google.colab.files.download` for deployment or further use.