Best Datasets for SFT Fine-Tuning β Verified Guide
Dataset Rankings (Quality β Model Performance)
#1: allenai/tulu-3-sft-mixture β THE BEST
- Size: 939K examples from 19 curated sources
- Format: messages column (role/content) - ZERO PREPROCESSING
- Sources: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
- Proven Results on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
- Training Recipe: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
- Status: VALIDATED - column format confirmed via hf_inspect_dataset
#2: open-thoughts/OpenThoughts-114k β REASONING CoT
- Size: 114K examples with DeepSeek-R1 reasoning traces
- Format: conversations column (from/value ShareGPT) - NEEDS CONVERSION
- Best For: Math, code, science with chain-of-thought
- Conversion: See train_openthoughts.py
- Training Recipe: LR=2e-4, batch=16, epochs=2, cosine schedule
- Status: VALIDATED - format confirmed, converter tested
#3: HuggingFaceH4/ultrachat_200k β GENERAL CHAT
- Size: 208K multi-turn conversations
- Format: messages column - ZERO PREPROCESSING (use train_sft split)
- Best For: General conversational ability
- Training Recipe: LR=2e-4, batch=16, epochs=1
#4: mlabonne/FineTome-100k β CURATED COMPACT
- Size: 100K quality-scored examples
- Format: conversations (ShareGPT) - NEEDS CONVERSION
- Best For: Quick fine-tune with curated quality
#5: HuggingFaceH4/no_robots β HUMAN-WRITTEN
- Size: 9.5K examples (all human-written)
- Format: messages column - ZERO PREPROCESSING
- Best For: High-quality instruction following
How to Train
Full Training (Tulu 3 - 940K) β A100 80GB, ~6h
python ai-ml/hf-finetuning/train_tulu3.py
Reasoning Training (OpenThoughts - 114K) β A100 80GB, ~2h
python ai-ml/hf-finetuning/train_openthoughts.py
Quick Test (100 steps) β Any GPU
python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
LoRA Config (LoRA Without Regret - Schulman 2025)
| Parameter | Tulu 3 Recipe | OpenThoughts Recipe |
|---|---|---|
| lora_r | 256 | 256 |
| lora_alpha | 16 | 16 |
| target_modules | all-linear | all-linear |
| learning_rate | 5e-6 | 2e-4 |
| effective_batch | 128 | 16 |
| epochs | 2 | 2 |
| max_seq_length | 4096 | 4096 |
| lr_schedule | linear | cosine |
| packing | True (bfd_split) | True (bfd_split) |
| assistant_only_loss | True | True |
Key Research Sources
- Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
- LoRA Without Regret: Schulman et al., 2025
- Data quality > quantity: arXiv 2402.05123