Spaces:
Sleeping
Sleeping
File size: 3,393 Bytes
25faba3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# π Dataset Size Guide for M2 Mac
## π― Quick Recommendation
**Use 10k-50k samples** for the best balance of performance and training time.
## π Comparison Table
| Dataset Size | Training Time | Memory Usage | Best For | Recommendation |
|-------------|---------------|--------------|----------|----------------|
| **1k** | ~5-10 min | Low | Quick testing | β οΈ Too small - high overfitting risk |
| **10k** | ~20-40 min | Medium | **Recommended start** | β
Good balance |
| **50k** | ~1-2 hours | Medium-High | **Best balance** | β
**RECOMMENDED** |
| **500k** | ~6-12 hours | High | Maximum performance | β οΈ Only if you have time |
## π Recommended Workflow
### Step 1: Start Small (1k-5k)
Test your pipeline quickly:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_5k.csv -n 5000
python scripts/run_train.py --config configs/m2_small.yaml --data data/dataset_5k.csv
```
**Time:** ~10 minutes
**Purpose:** Validate your setup works
### Step 2: Scale Up (10k-50k) β RECOMMENDED
Train your production model:
```bash
python scripts/sample_dataset.py data/your_500k_dataset.csv data/dataset_50k.csv -n 50000
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_50k.csv
```
**Time:** ~1-2 hours
**Purpose:** Best performance/time trade-off
### Step 3: Full Dataset (Optional)
Only if you need maximum performance:
```bash
python scripts/run_train.py --config configs/m2_large.yaml --data data/your_500k_dataset.csv
```
**Time:** ~6-12 hours
**Purpose:** Maximum accuracy (marginal gains)
## π‘ Why 10k-50k is Best
1. **Sufficient Diversity**: Enough examples to learn patterns without overfitting
2. **Manageable Time**: 1-2 hours vs 6-12 hours for 500k
3. **Good Performance**: For AI text detection, 50k is usually enough
4. **Quick Iterations**: You can experiment with hyperparameters faster
## π§ M2 Mac Optimizations
Your configs are optimized for:
- **CPU training** (M2 doesn't have CUDA)
- **Unified memory** (8-24GB typical)
- **Batch size tuning** (smaller batches for larger datasets)
- **Gradient accumulation** (simulates larger batches)
## π Example Commands
```bash
# Sample 10k balanced samples
python scripts/sample_dataset.py data/large_dataset.csv data/dataset_10k.csv -n 10000
# Train with medium config
python scripts/run_train.py --config configs/m2_medium.yaml --data data/dataset_10k.csv
# Or use the full dataset
python scripts/run_train.py --config configs/m2_large.yaml --data data/large_dataset.csv
```
## β‘ Performance Tips
1. **Start with 10k** - Validate everything works
2. **Scale to 50k** - Get good performance
3. **Only use 500k** if:
- You have 6+ hours to spare
- You need every last % of accuracy
- You're doing research/comparison
## π For AI Text Detection Specifically
AI text detection typically needs:
- β
**Diverse AI models** (GPT-3, GPT-4, Claude, etc.)
- β
**Diverse human writing** (essays, stories, technical, casual)
- β
**Balanced classes** (50/50 or close)
**10k-50k samples** with good diversity will outperform **500k samples** with poor diversity.
## π¨ When to Use Each Size
- **1k**: β Don't use for production - too small
- **10k**: β
Good for initial training and testing
- **50k**: β
**BEST CHOICE** - production ready
- **500k**: β οΈ Only if you have time and need maximum accuracy
|