# 📊 Dataset Size Guide for M2 Mac

## 🎯 Quick Recommendation

**Use 10k-50k samples** for the best balance of performance and training time.

## 📈 Comparison Table

| Dataset Size | Training Time | Memory Usage | Best For | Recommendation |
|-------------|---------------|--------------|----------|----------------|
| **1k** | ~5-10 min | Low | Quick testing | ⚠️ Too small - high overfitting risk |
| **10k** | ~20-40 min | Medium | **Recommended start** | ✅ Good balance |
| **50k** | ~1-2 hours | Medium-High | **Best balance** | ✅ **RECOMMENDED** |
| **500k** | ~6-12 hours | High | Maximum performance | ⚠️ Only if you have time |

## 🚀 Recommended Workflow

### Step 1: Start Small (1k-5k)
Test your pipeline quickly:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv
```
**Time:** ~10 minutes  
**Purpose:** Validate your setup works

### Step 2: Scale Up (10k-50k) ⭐ RECOMMENDED
Train your production model:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv
```
**Time:** ~1-2 hours  
**Purpose:** Best performance/time trade-off

### Step 3: Full Dataset (Optional)
Only if you need maximum performance:
```bash
python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv
```
**Time:** ~6-12 hours  
**Purpose:** Maximum accuracy (marginal gains)

## 💡 Why 10k-50k is Best

1. **Sufficient Diversity**: Enough examples to learn patterns without overfitting
2. **Manageable Time**: 1-2 hours vs 6-12 hours for 500k
3. **Good Performance**: For AI text detection, 50k is usually enough
4. **Quick Iterations**: You can experiment with hyperparameters faster

## 🔧 M2 Mac Optimizations

Your configs are optimized for:
- **CPU training** (M2 doesn't have CUDA)
- **Unified memory** (8-24GB typical)
- **Batch size tuning** (smaller batches for larger datasets)
- **Gradient accumulation** (simulates larger batches)

## 📝 Example Commands

```bash
# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000

# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv

# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv
```

## ⚡ Performance Tips

1. **Start with 10k** - Validate everything works
2. **Scale to 50k** - Get good performance
3. **Only use 500k** if:
   - You have 6+ hours to spare
   - You need every last % of accuracy
   - You're doing research/comparison

## 🎓 For AI Text Detection Specifically

AI text detection typically needs:
- ✅ **Diverse AI models** (GPT-3, GPT-4, Claude, etc.)
- ✅ **Diverse human writing** (essays, stories, technical, casual)
- ✅ **Balanced classes** (50/50 or close)

**10k-50k samples** with good diversity will outperform **500k samples** with poor diversity.

## 🚨 When to Use Each Size

- **1k**: ❌ Don't use for production - too small
- **10k**: ✅ Good for initial training and testing
- **50k**: ✅ **BEST CHOICE** - production ready
- **500k**: ⚠️ Only if you have time and need maximum accuracy