# 🚀 Training Guide

## Problem
The mutex lock error `[mutex.cc : 452] RAW: Lock blocking...` happens because:
1. HuggingFace Trainer API tries to use multiprocessing
2. macOS doesn't handle multiprocessing from tokenizers well
3. Environment variables alone aren't enough to fix it completely

## Solution

### ✅ BEST: Use the Simple Training Script (Recommended)

The simple training script avoids the Trainer API entirely:

```bash
python scripts/run_train_simple.py
```

**What it does:**
- ✅ No multiprocessing
- ✅ No threading issues  
- ✅ Direct PyTorch training loop
- ✅ Works on macOS
- ✅ Same results as Trainer API

**Output:**
- Trains for 2 epochs
- Shows progress with tqdm
- Saves model to `models/ai_detector`

### Alternative: Shell Script

```bash
bash train_macos.sh
```

This sets all environment variables and runs the simple script.

## If You Still Get Errors

### Option 1: Reduce to Tiny Dataset
```bash
python scripts/sample_dataset.py data/ai_vs_human_text.csv data/tiny.csv -n 100
# Then edit configs/default.yaml:
#   data_path: data/tiny.csv
python scripts/run_train.py
```

### Option 2: Run Outside venv
```bash
# Exit your virtualenv
deactivate

# Install system-wide
pip install --user -r requirements.txt

# Train
python scripts/run_train_simple.py
```

### Option 3: Use Colab/Cloud
If nothing works locally, train on Google Colab (free GPU):
- Upload your data to Google Drive
- Use the Colab notebook template
- Much faster training

## Key Differences

### Simple Script (`run_train_simple.py`)
- ✅ No Trainer API (no multiprocessing issues)
- ✅ Works on macOS
- ✅ Good for small-medium datasets
- ⚠️ Less efficient on large datasets

### Standard Script (`run_train.py`)
- Uses HuggingFace Trainer API
- ✅ Optimized for large datasets
- ⚠️ Multiprocessing issues on macOS

## Recommended Setup

1. **Dataset:** ✅ Downloaded (`data/ai_vs_human_text.csv`)
2. **Config:** ✅ Updated (`configs/default.yaml`)
3. **Training:** Use `run_train_simple.py`

## Start Training

```bash
python scripts/run_train_simple.py
```

Should see output like:
```
🚀 Starting training (simple mode - no multiprocessing)
============================================================

📖 Loading data from data/ai_vs_human_text.csv...
   Loaded 1,000 samples
   Distribution: {0: 493, 1: 507}
   Train: 800 | Val: 200

🤖 Loading model: roberta-base...

📊 Creating datasets...

⚙️  Training for 2 epochs...
```

Good luck! 🎉