AITextDetector / DATASET_SIZE_GUIDE.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified
# πŸ“Š Dataset Size Guide for M2 Mac
## 🎯 Quick Recommendation
**Use 10k-50k samples** for the best balance of performance and training time.
## πŸ“ˆ Comparison Table
| Dataset Size | Training Time | Memory Usage | Best For | Recommendation |
|-------------|---------------|--------------|----------|----------------|
| **1k** | ~5-10 min | Low | Quick testing | ⚠️ Too small - high overfitting risk |
| **10k** | ~20-40 min | Medium | **Recommended start** | βœ… Good balance |
| **50k** | ~1-2 hours | Medium-High | **Best balance** | βœ… **RECOMMENDED** |
| **500k** | ~6-12 hours | High | Maximum performance | ⚠️ Only if you have time |
## πŸš€ Recommended Workflow
### Step 1: Start Small (1k-5k)
Test your pipeline quickly:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv
```
**Time:** ~10 minutes
**Purpose:** Validate your setup works
### Step 2: Scale Up (10k-50k) ⭐ RECOMMENDED
Train your production model:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv
```
**Time:** ~1-2 hours
**Purpose:** Best performance/time trade-off
### Step 3: Full Dataset (Optional)
Only if you need maximum performance:
```bash
python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv
```
**Time:** ~6-12 hours
**Purpose:** Maximum accuracy (marginal gains)
## πŸ’‘ Why 10k-50k is Best
1. **Sufficient Diversity**: Enough examples to learn patterns without overfitting
2. **Manageable Time**: 1-2 hours vs 6-12 hours for 500k
3. **Good Performance**: For AI text detection, 50k is usually enough
4. **Quick Iterations**: You can experiment with hyperparameters faster
## πŸ”§ M2 Mac Optimizations
Your configs are optimized for:
- **CPU training** (M2 doesn't have CUDA)
- **Unified memory** (8-24GB typical)
- **Batch size tuning** (smaller batches for larger datasets)
- **Gradient accumulation** (simulates larger batches)
## πŸ“ Example Commands
```bash
# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000
# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv
# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv
```
## ⚑ Performance Tips
1. **Start with 10k** - Validate everything works
2. **Scale to 50k** - Get good performance
3. **Only use 500k** if:
- You have 6+ hours to spare
- You need every last % of accuracy
- You're doing research/comparison
## πŸŽ“ For AI Text Detection Specifically
AI text detection typically needs:
- βœ… **Diverse AI models** (GPT-3, GPT-4, Claude, etc.)
- βœ… **Diverse human writing** (essays, stories, technical, casual)
- βœ… **Balanced classes** (50/50 or close)
**10k-50k samples** with good diversity will outperform **500k samples** with poor diversity.
## 🚨 When to Use Each Size
- **1k**: ❌ Don't use for production - too small
- **10k**: βœ… Good for initial training and testing
- **50k**: βœ… **BEST CHOICE** - production ready
- **500k**: ⚠️ Only if you have time and need maximum accuracy