Spaces:

ChauHPham
/

AITextDetector

Sleeping

App Files Files Community

AITextDetector / DATASET_SIZE_GUIDE.md

ChauHPham

Upload folder using huggingface_hub

25faba3 verified about 2 months ago

preview code

raw

history blame contribute delete

3.39 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

📊 Dataset Size Guide for M2 Mac

🎯 Quick Recommendation

Use 10k-50k samples for the best balance of performance and training time.

📈 Comparison Table

Dataset Size	Training Time	Memory Usage	Best For	Recommendation
1k	~5-10 min	Low	Quick testing	⚠️ Too small - high overfitting risk
10k	~20-40 min	Medium	Recommended start	✅ Good balance
50k	~1-2 hours	Medium-High	Best balance	✅ RECOMMENDED
500k	~6-12 hours	High	Maximum performance	⚠️ Only if you have time

🚀 Recommended Workflow

Step 1: Start Small (1k-5k)

Test your pipeline quickly:

python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv

Time: ~10 minutes
Purpose: Validate your setup works

Step 2: Scale Up (10k-50k) ⭐ RECOMMENDED

Train your production model:

python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv

Time: ~1-2 hours
Purpose: Best performance/time trade-off

Step 3: Full Dataset (Optional)

Only if you need maximum performance:

python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv

Time: ~6-12 hours
Purpose: Maximum accuracy (marginal gains)

💡 Why 10k-50k is Best

Sufficient Diversity: Enough examples to learn patterns without overfitting
Manageable Time: 1-2 hours vs 6-12 hours for 500k
Good Performance: For AI text detection, 50k is usually enough
Quick Iterations: You can experiment with hyperparameters faster

🔧 M2 Mac Optimizations

Your configs are optimized for:

CPU training (M2 doesn't have CUDA)
Unified memory (8-24GB typical)
Batch size tuning (smaller batches for larger datasets)
Gradient accumulation (simulates larger batches)

📝 Example Commands

# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000

# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv

# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv

⚡ Performance Tips

Start with 10k - Validate everything works
Scale to 50k - Get good performance
Only use 500k if:
- You have 6+ hours to spare
- You need every last % of accuracy
- You're doing research/comparison

🎓 For AI Text Detection Specifically

AI text detection typically needs:

✅ Diverse AI models (GPT-3, GPT-4, Claude, etc.)
✅ Diverse human writing (essays, stories, technical, casual)
✅ Balanced classes (50/50 or close)

10k-50k samples with good diversity will outperform 500k samples with poor diversity.

🚨 When to Use Each Size

1k: ❌ Don't use for production - too small
10k: ✅ Good for initial training and testing
50k: ✅ BEST CHOICE - production ready
500k: ⚠️ Only if you have time and need maximum accuracy