Spaces:

ChauHPham
/

AITextDetector

Sleeping

App Files Files Community

AITextDetector / DATASET_SIZE_GUIDE.md

ChauHPham

Upload folder using huggingface_hub

25faba3 verified about 2 months ago

preview code

raw

history blame contribute delete

3.39 kB

	# 📊 Dataset Size Guide for M2 Mac

	## 🎯 Quick Recommendation

	Use 10k-50k samples for the best balance of performance and training time.

	## 📈 Comparison Table

	\| Dataset Size \| Training Time \| Memory Usage \| Best For \| Recommendation \|
	\|-------------\|---------------\|--------------\|----------\|----------------\|
	\| 1k \| ~5-10 min \| Low \| Quick testing \| ⚠️ Too small - high overfitting risk \|
	\| 10k \| ~20-40 min \| Medium \| Recommended start \| ✅ Good balance \|
	\| 50k \| ~1-2 hours \| Medium-High \| Best balance \| ✅ RECOMMENDED \|
	\| 500k \| ~6-12 hours \| High \| Maximum performance \| ⚠️ Only if you have time \|

	## 🚀 Recommended Workflow

	### Step 1: Start Small (1k-5k)
	Test your pipeline quickly:
	```bash
	python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
	python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv
	```
	Time: ~10 minutes
	Purpose: Validate your setup works

	### Step 2: Scale Up (10k-50k) ⭐ RECOMMENDED
	Train your production model:
	```bash
	python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
	python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv
	```
	Time: ~1-2 hours
	Purpose: Best performance/time trade-off

	### Step 3: Full Dataset (Optional)
	Only if you need maximum performance:
	```bash
	python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv
	```
	Time: ~6-12 hours
	Purpose: Maximum accuracy (marginal gains)

	## 💡 Why 10k-50k is Best

	1. Sufficient Diversity: Enough examples to learn patterns without overfitting
	2. Manageable Time: 1-2 hours vs 6-12 hours for 500k
	3. Good Performance: For AI text detection, 50k is usually enough
	4. Quick Iterations: You can experiment with hyperparameters faster

	## 🔧 M2 Mac Optimizations

	Your configs are optimized for:
	- CPU training (M2 doesn't have CUDA)
	- Unified memory (8-24GB typical)
	- Batch size tuning (smaller batches for larger datasets)
	- Gradient accumulation (simulates larger batches)

	## 📝 Example Commands

	```bash
	# Sample 10k balanced samples
	python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000

	# Train with medium config
	python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv

	# Or use the full dataset
	python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv
	```

	## ⚡ Performance Tips

	1. Start with 10k - Validate everything works
	2. Scale to 50k - Get good performance
	3. Only use 500k if:
	- You have 6+ hours to spare
	- You need every last % of accuracy
	- You're doing research/comparison

	## 🎓 For AI Text Detection Specifically

	AI text detection typically needs:
	- ✅ Diverse AI models (GPT-3, GPT-4, Claude, etc.)
	- ✅ Diverse human writing (essays, stories, technical, casual)
	- ✅ Balanced classes (50/50 or close)

	10k-50k samples with good diversity will outperform 500k samples with poor diversity.

	## 🚨 When to Use Each Size

	- 1k: ❌ Don't use for production - too small
	- 10k: ✅ Good for initial training and testing
	- 50k: ✅ BEST CHOICE - production ready
	- 500k: ⚠️ Only if you have time and need maximum accuracy