File size: 2,875 Bytes
3bccccd
 
 
 
 
 
 
 
 
 
 
 
 
640b4b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
title: MGT Detection
emoji: 🐠
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: MGT-Detection
---

# MGT-Detection

## Overview
MGT-Detection (Machine-Generated Text Detection) is a project designed to classify and detect whether a given text is human-written or machine-generated. The project leverages state-of-the-art natural language processing (NLP) models and pipelines to achieve accurate classification results. It includes tools for training, evaluating, and deploying models for text classification tasks.

## Features
- **Text Classification**: Detects whether a text is human-written or machine-generated.
- **Model Training Pipeline**: Includes hyperparameter optimization, dataset preparation, and model training.
- **Evaluation**: Provides metrics such as accuracy, precision, recall, and F1 score.
- **Dataset Management**: Tools for preparing and tokenizing datasets.
- **Model Deployment**: Save and load fine-tuned models for deployment.

## Project Structure
```
MGT-Detection/
β”œβ”€β”€ app.py                # Main application for text classification
β”œβ”€β”€ pipeline/
β”‚   β”œβ”€β”€ dataset.py        # Dataset preparation and management
β”‚   β”œβ”€β”€ model_pipeline.py # Model training and evaluation pipeline
β”‚   β”œβ”€β”€ main.py           # Entry point for running the training pipeline
β”œβ”€β”€ samples.json          # Sample dataset for testing
```


## Usage
### Running the Application
To launch the text classification application:
```bash
python app.py
```

### Training a Model
To train a model using the pipeline:
```bash
python pipeline/main.py \
  --file_path <path_to_dataset> \
  --out_path <output_directory> \
  --model_name <model_name> \
  --num_labels 2 \
  --sample_frac 1.0 \
  --num_trials 5 \
  --num_epochs 5
```

### Dataset Preparation
Ensure your dataset is in JSON format with the following structure:
```json
[
  {
    "text": "<text_sample>",
    "label": "<label>",
  },
  ...
]
```

## Key Components
### `app.py`
- Provides a user interface for classifying text as human-written or machine-generated.

### `pipeline/model_pipeline.py`
- Contains functions for model training, hyperparameter optimization, and evaluation.

### `pipeline/dataset.py`
- Handles dataset preparation, tokenization, and saving/loading datasets.

### `samples.json`
- A sample dataset for testing the application.

## Requirements
- Python 3.8+
- Transformers
- Datasets
- Optuna
- Gradio
- Scikit-learn

## Contributing
Contributions are welcome! Please fork the repository and submit a pull request with your changes.

## License
This project is licensed under the MIT License. See the LICENSE file for details.

## Acknowledgments
- Hugging Face Transformers
- Optuna for hyperparameter optimization
- Gradio for building the user interface