Update README.md

25cbb4e verified 13 days ago

3.72 kB

	---
	license: apache-2.0
	base_model: vennify/t5-base-grammar-correction
	tags:
	- grammar-correction
	- t5
	- explainable-ai
	- text2text-generation
	datasets:
	- jfleg
	widget:
	- text: "gec: Hey how are you"
	example_title: "Basic Correction"
	---

	# Grammar Error Correction (GEC) with T5

	This notebook demonstrates how to fine-tune and use a T5-based model for Grammar Error Correction (GEC). It leverages the `transformers` library from Hugging Face for model handling and training.

	## Table of Contents

	- [Introduction](#introduction)
	- [Dataset](#dataset)
	- [Model Training](#model-training)
	- [Explainable AI Judge](#explainable-ai-judge)
	- [Evaluation](#evaluation)
	- [Save and Download Model](#save-and-download-model)

	## Introduction

	This project focuses on building a Grammar Error Correction system using a pre-trained T5 model. GEC is the task of identifying and correcting grammatical errors in text. The notebook covers:

	1. Data Preparation: Extracting and parsing M2 format datasets.
	2. Model Loading: Loading a T5 model and tokenizer from Hugging Face.
	3. Fine-tuning: Training the T5 model on the prepared GEC dataset.
	4. Explainable AI: Implementing a custom `ExplainableAIJudge` to provide rationales for corrections.
	5. Evaluation: Setting up a basic evaluation framework for GEC and explanation quality.
	6. Model Export: Saving and downloading the fine-tuned model.

	## Dataset

	The notebook uses data from two sources, primarily targeting M2 format files:

	- CoNLL-14 Shared Task Data: Used for training (`conll14st-test-data.tar.gz`).
	- WI+LOCNESS M2 Data (BEA-19): Also used for training (`wi+locness_v2.1.bea19.tar.gz`).

	The `parse_m2` function extracts source sentences and their corresponding target corrections from these files. The data is then transformed into a format suitable for sequence-to-sequence models.

	## Model Training

	The core of the GEC system is a T5 model. Specifically, it uses `vennify/t5-base-grammar-correction` as the base model.

	### `GECDataset` Class

	A custom `GECDataset` class prepares the data for the T5 model, tokenizing source and target sentences and ensuring they are padded/truncated to a maximum length. Each source sentence is prefixed with `"gec: "` to prompt the T5 model for grammar correction.

	### `Trainer` from Hugging Face

	The `Trainer` API from `transformers` is used for fine-tuning the T5 model. It handles the training loop, logging, and model saving.

	## Explainable AI Judge

	An `ExplainableAIJudge` class is implemented to not only correct grammar but also provide human-readable explanations for the changes. It leverages `difflib.SequenceMatcher` to find differences between the original and corrected sentences and maps these differences to predefined error types with explanations.

	### Error Mapping

	The `error_map` dictionary translates internal error codes (e.g., `R:VERB:TENSE`, `R:PUNCT`) into descriptive explanations.

	## Evaluation

	The `evaluate_xgec` function provides metrics for both correction performance and explanation quality:

	- Correction Metrics: Accuracy, Precision, Recall, and F0.5 score, comparing hypotheses (model's corrections) against references (ground truth).
	- Explanation Quality: BERTScore F1 for semantic similarity between generated explanations and reference explanations, and Error Type Accuracy for how well the model's identified error types match the ground truth.

	## Save and Download Model

	After training, the fine-tuned model and tokenizer are saved to a local directory. This directory is then compressed into a `gec_model.tar.gz` file, which can be downloaded using `google.colab.files.download` for deployment or further use.