AITextDetector / TRAINING_GUIDE.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified
# πŸš€ Training Guide
## Problem
The mutex lock error `[mutex.cc : 452] RAW: Lock blocking...` happens because:
1. HuggingFace Trainer API tries to use multiprocessing
2. macOS doesn't handle multiprocessing from tokenizers well
3. Environment variables alone aren't enough to fix it completely
## Solution
### βœ… BEST: Use the Simple Training Script (Recommended)
The simple training script avoids the Trainer API entirely:
```bash
python scripts/run_train_simple.py
```
**What it does:**
- βœ… No multiprocessing
- βœ… No threading issues
- βœ… Direct PyTorch training loop
- βœ… Works on macOS
- βœ… Same results as Trainer API
**Output:**
- Trains for 2 epochs
- Shows progress with tqdm
- Saves model to `models/ai_detector`
### Alternative: Shell Script
```bash
bash train_macos.sh
```
This sets all environment variables and runs the simple script.
## If You Still Get Errors
### Option 1: Reduce to Tiny Dataset
```bash
python scripts/sample_dataset.py data/ai_vs_human_text.csv data/tiny.csv -n 100
# Then edit configs/default.yaml:
# data_path: data/tiny.csv
python scripts/run_train.py
```
### Option 2: Run Outside venv
```bash
# Exit your virtualenv
deactivate
# Install system-wide
pip install --user -r requirements.txt
# Train
python scripts/run_train_simple.py
```
### Option 3: Use Colab/Cloud
If nothing works locally, train on Google Colab (free GPU):
- Upload your data to Google Drive
- Use the Colab notebook template
- Much faster training
## Key Differences
### Simple Script (`run_train_simple.py`)
- βœ… No Trainer API (no multiprocessing issues)
- βœ… Works on macOS
- βœ… Good for small-medium datasets
- ⚠️ Less efficient on large datasets
### Standard Script (`run_train.py`)
- Uses HuggingFace Trainer API
- βœ… Optimized for large datasets
- ⚠️ Multiprocessing issues on macOS
## Recommended Setup
1. **Dataset:** βœ… Downloaded (`data/ai_vs_human_text.csv`)
2. **Config:** βœ… Updated (`configs/default.yaml`)
3. **Training:** Use `run_train_simple.py`
## Start Training
```bash
python scripts/run_train_simple.py
```
Should see output like:
```
πŸš€ Starting training (simple mode - no multiprocessing)
============================================================
πŸ“– Loading data from data/ai_vs_human_text.csv...
Loaded 1,000 samples
Distribution: {0: 493, 1: 507}
Train: 800 | Val: 200
πŸ€– Loading model: roberta-base...
πŸ“Š Creating datasets...
βš™οΈ Training for 2 epochs...
```
Good luck! πŸŽ‰