Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.2.0
π Dataset Size Guide for M2 Mac
π― Quick Recommendation
Use 10k-50k samples for the best balance of performance and training time.
π Comparison Table
| Dataset Size | Training Time | Memory Usage | Best For | Recommendation |
|---|---|---|---|---|
| 1k | ~5-10 min | Low | Quick testing | β οΈ Too small - high overfitting risk |
| 10k | ~20-40 min | Medium | Recommended start | β Good balance |
| 50k | ~1-2 hours | Medium-High | Best balance | β RECOMMENDED |
| 500k | ~6-12 hours | High | Maximum performance | β οΈ Only if you have time |
π Recommended Workflow
Step 1: Start Small (1k-5k)
Test your pipeline quickly:
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv
Time: ~10 minutes
Purpose: Validate your setup works
Step 2: Scale Up (10k-50k) β RECOMMENDED
Train your production model:
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv
Time: ~1-2 hours
Purpose: Best performance/time trade-off
Step 3: Full Dataset (Optional)
Only if you need maximum performance:
python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv
Time: ~6-12 hours
Purpose: Maximum accuracy (marginal gains)
π‘ Why 10k-50k is Best
- Sufficient Diversity: Enough examples to learn patterns without overfitting
- Manageable Time: 1-2 hours vs 6-12 hours for 500k
- Good Performance: For AI text detection, 50k is usually enough
- Quick Iterations: You can experiment with hyperparameters faster
π§ M2 Mac Optimizations
Your configs are optimized for:
- CPU training (M2 doesn't have CUDA)
- Unified memory (8-24GB typical)
- Batch size tuning (smaller batches for larger datasets)
- Gradient accumulation (simulates larger batches)
π Example Commands
# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000
# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv
# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv
β‘ Performance Tips
- Start with 10k - Validate everything works
- Scale to 50k - Get good performance
- Only use 500k if:
- You have 6+ hours to spare
- You need every last % of accuracy
- You're doing research/comparison
π For AI Text Detection Specifically
AI text detection typically needs:
- β Diverse AI models (GPT-3, GPT-4, Claude, etc.)
- β Diverse human writing (essays, stories, technical, casual)
- β Balanced classes (50/50 or close)
10k-50k samples with good diversity will outperform 500k samples with poor diversity.
π¨ When to Use Each Size
- 1k: β Don't use for production - too small
- 10k: β Good for initial training and testing
- 50k: β BEST CHOICE - production ready
- 500k: β οΈ Only if you have time and need maximum accuracy