File size: 2,315 Bytes
92b7abc
 
25faba3
92b7abc
 
 
84d4bbc
92b7abc
84d4bbc
25faba3
84d4bbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25faba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84d4bbc
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
title: AITextDetector
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.49.1
---
# AI Text Detector

A learning project for detecting AI-generated vs. human-written text with a modular Python package, YAML configs, GPU auto-detection, CLI, and a **Gradio web interface**.

## 🌐 Web Interface (Gradio)

**Try it now on Google Colab** (works perfectly on Mac M2!):

```python
!pip install -q transformers torch pandas gradio kagglehub
!git clone https://github.com/ChauHPham/AITextDetector.git
%cd AITextDetector
!python gradio_app.py
```

Get a **public shareable link** instantly! See [DEPLOY.md](DEPLOY.md) for deployment options.

### 🍎 Mac M2 Users

**Google Colab is recommended** - local training may fail due to PyTorch MPS mutex lock issues. The Gradio app works great in Colab with free GPU!

## Quickstart (CLI)

```bash
# 1) Create & activate a virtualenv (recommended)
python -m venv .venv && source .venv/bin/activate

# 2) Install
pip install -r requirements.txt
pip install -e .

# 3) (Optional) Download Kaggle datasets into data/
python scripts/kaggle_downloader.py

# 4) Configure
cp configs/default.yaml configs/local.yaml
# edit local.yaml if desired (change data_path, hyperparams, etc.)

# 5) Train
ai-detector train --data data/dataset.csv --config configs/local.yaml

# 6) Evaluate
ai-detector eval --model-path models/ai_detector --data data/dataset.csv --config configs/local.yaml
```

## Datasets

* LLM Detect AI Generated Text Dataset (Kaggle)
* AI vs Human Text (Kaggle)

Use `scripts/kaggle_downloader.py` to fetch them. You may need to normalize/merge columns; the loader tries common names (`text`, `content`, `essay` and `label`, `class`, `target`).

## Config

See `configs/default.yaml`. Key fields:

* `base_model`: e.g., `roberta-base`
* `max_length`, `batch_size`, `num_epochs`, `lr`
* `fp16`: set `null` to auto-enable on CUDA

## Notes

* Labels standardized to `0=human`, `1=ai`.
* Mixed precision (fp16) auto-enables on CUDA.
* Evaluate with accuracy, macro-F1, and confusion matrix.
* **Mac M2 users**: Use Google Colab for training (see above) to avoid PyTorch MPS bugs.

## Deployment

See [DEPLOY.md](DEPLOY.md) for:
- Google Colab setup (recommended for Mac M2)
- Hugging Face Spaces deployment (`gradio deploy`)
- Docker deployment
- Troubleshooting guide