MGT-Detection / README.md
ziadmostafa's picture
added app files
640b4b2

A newer version of the Gradio SDK is available: 6.2.0

Upgrade
metadata
title: MGT Detection
emoji: 🐠
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: MGT-Detection

MGT-Detection

Overview

MGT-Detection (Machine-Generated Text Detection) is a project designed to classify and detect whether a given text is human-written or machine-generated. The project leverages state-of-the-art natural language processing (NLP) models and pipelines to achieve accurate classification results. It includes tools for training, evaluating, and deploying models for text classification tasks.

Features

  • Text Classification: Detects whether a text is human-written or machine-generated.
  • Model Training Pipeline: Includes hyperparameter optimization, dataset preparation, and model training.
  • Evaluation: Provides metrics such as accuracy, precision, recall, and F1 score.
  • Dataset Management: Tools for preparing and tokenizing datasets.
  • Model Deployment: Save and load fine-tuned models for deployment.

Project Structure

MGT-Detection/
β”œβ”€β”€ app.py                # Main application for text classification
β”œβ”€β”€ pipeline/
β”‚   β”œβ”€β”€ dataset.py        # Dataset preparation and management
β”‚   β”œβ”€β”€ model_pipeline.py # Model training and evaluation pipeline
β”‚   β”œβ”€β”€ main.py           # Entry point for running the training pipeline
β”œβ”€β”€ samples.json          # Sample dataset for testing

Usage

Running the Application

To launch the text classification application:

python app.py

Training a Model

To train a model using the pipeline:

python pipeline/main.py \
  --file_path <path_to_dataset> \
  --out_path <output_directory> \
  --model_name <model_name> \
  --num_labels 2 \
  --sample_frac 1.0 \
  --num_trials 5 \
  --num_epochs 5

Dataset Preparation

Ensure your dataset is in JSON format with the following structure:

[
  {
    "text": "<text_sample>",
    "label": "<label>",
  },
  ...
]

Key Components

app.py

  • Provides a user interface for classifying text as human-written or machine-generated.

pipeline/model_pipeline.py

  • Contains functions for model training, hyperparameter optimization, and evaluation.

pipeline/dataset.py

  • Handles dataset preparation, tokenization, and saving/loading datasets.

samples.json

  • A sample dataset for testing the application.

Requirements

  • Python 3.8+
  • Transformers
  • Datasets
  • Optuna
  • Gradio
  • Scikit-learn

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Hugging Face Transformers
  • Optuna for hyperparameter optimization
  • Gradio for building the user interface