AITextDetector / DATASET_SIZE_GUIDE.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

πŸ“Š Dataset Size Guide for M2 Mac

🎯 Quick Recommendation

Use 10k-50k samples for the best balance of performance and training time.

πŸ“ˆ Comparison Table

Dataset Size Training Time Memory Usage Best For Recommendation
1k ~5-10 min Low Quick testing ⚠️ Too small - high overfitting risk
10k ~20-40 min Medium Recommended start βœ… Good balance
50k ~1-2 hours Medium-High Best balance βœ… RECOMMENDED
500k ~6-12 hours High Maximum performance ⚠️ Only if you have time

πŸš€ Recommended Workflow

Step 1: Start Small (1k-5k)

Test your pipeline quickly:

python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv

Time: ~10 minutes
Purpose: Validate your setup works

Step 2: Scale Up (10k-50k) ⭐ RECOMMENDED

Train your production model:

python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv

Time: ~1-2 hours
Purpose: Best performance/time trade-off

Step 3: Full Dataset (Optional)

Only if you need maximum performance:

python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv

Time: ~6-12 hours
Purpose: Maximum accuracy (marginal gains)

πŸ’‘ Why 10k-50k is Best

  1. Sufficient Diversity: Enough examples to learn patterns without overfitting
  2. Manageable Time: 1-2 hours vs 6-12 hours for 500k
  3. Good Performance: For AI text detection, 50k is usually enough
  4. Quick Iterations: You can experiment with hyperparameters faster

πŸ”§ M2 Mac Optimizations

Your configs are optimized for:

  • CPU training (M2 doesn't have CUDA)
  • Unified memory (8-24GB typical)
  • Batch size tuning (smaller batches for larger datasets)
  • Gradient accumulation (simulates larger batches)

πŸ“ Example Commands

# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000

# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv

# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv

⚑ Performance Tips

  1. Start with 10k - Validate everything works
  2. Scale to 50k - Get good performance
  3. Only use 500k if:
    • You have 6+ hours to spare
    • You need every last % of accuracy
    • You're doing research/comparison

πŸŽ“ For AI Text Detection Specifically

AI text detection typically needs:

  • βœ… Diverse AI models (GPT-3, GPT-4, Claude, etc.)
  • βœ… Diverse human writing (essays, stories, technical, casual)
  • βœ… Balanced classes (50/50 or close)

10k-50k samples with good diversity will outperform 500k samples with poor diversity.

🚨 When to Use Each Size

  • 1k: ❌ Don't use for production - too small
  • 10k: βœ… Good for initial training and testing
  • 50k: βœ… BEST CHOICE - production ready
  • 500k: ⚠️ Only if you have time and need maximum accuracy