Spaces:
Sleeping
Sleeping
| # π Training Guide | |
| ## Problem | |
| The mutex lock error `[mutex.cc : 452] RAW: Lock blocking...` happens because: | |
| 1. HuggingFace Trainer API tries to use multiprocessing | |
| 2. macOS doesn't handle multiprocessing from tokenizers well | |
| 3. Environment variables alone aren't enough to fix it completely | |
| ## Solution | |
| ### β BEST: Use the Simple Training Script (Recommended) | |
| The simple training script avoids the Trainer API entirely: | |
| ```bash | |
| python scripts/run_train_simple.py | |
| ``` | |
| **What it does:** | |
| - β No multiprocessing | |
| - β No threading issues | |
| - β Direct PyTorch training loop | |
| - β Works on macOS | |
| - β Same results as Trainer API | |
| **Output:** | |
| - Trains for 2 epochs | |
| - Shows progress with tqdm | |
| - Saves model to `models/ai_detector` | |
| ### Alternative: Shell Script | |
| ```bash | |
| bash train_macos.sh | |
| ``` | |
| This sets all environment variables and runs the simple script. | |
| ## If You Still Get Errors | |
| ### Option 1: Reduce to Tiny Dataset | |
| ```bash | |
| python scripts/sample_dataset.py data/ai_vs_human_text.csv data/tiny.csv -n 100 | |
| # Then edit configs/default.yaml: | |
| # data_path: data/tiny.csv | |
| python scripts/run_train.py | |
| ``` | |
| ### Option 2: Run Outside venv | |
| ```bash | |
| # Exit your virtualenv | |
| deactivate | |
| # Install system-wide | |
| pip install --user -r requirements.txt | |
| # Train | |
| python scripts/run_train_simple.py | |
| ``` | |
| ### Option 3: Use Colab/Cloud | |
| If nothing works locally, train on Google Colab (free GPU): | |
| - Upload your data to Google Drive | |
| - Use the Colab notebook template | |
| - Much faster training | |
| ## Key Differences | |
| ### Simple Script (`run_train_simple.py`) | |
| - β No Trainer API (no multiprocessing issues) | |
| - β Works on macOS | |
| - β Good for small-medium datasets | |
| - β οΈ Less efficient on large datasets | |
| ### Standard Script (`run_train.py`) | |
| - Uses HuggingFace Trainer API | |
| - β Optimized for large datasets | |
| - β οΈ Multiprocessing issues on macOS | |
| ## Recommended Setup | |
| 1. **Dataset:** β Downloaded (`data/ai_vs_human_text.csv`) | |
| 2. **Config:** β Updated (`configs/default.yaml`) | |
| 3. **Training:** Use `run_train_simple.py` | |
| ## Start Training | |
| ```bash | |
| python scripts/run_train_simple.py | |
| ``` | |
| Should see output like: | |
| ``` | |
| π Starting training (simple mode - no multiprocessing) | |
| ============================================================ | |
| π Loading data from data/ai_vs_human_text.csv... | |
| Loaded 1,000 samples | |
| Distribution: {0: 493, 1: 507} | |
| Train: 800 | Val: 200 | |
| π€ Loading model: roberta-base... | |
| π Creating datasets... | |
| βοΈ Training for 2 epochs... | |
| ``` | |
| Good luck! π | |