| # Best Datasets for SFT Fine-Tuning β Verified Guide |
|
|
| ## Dataset Rankings (Quality β Model Performance) |
|
|
| ### #1: allenai/tulu-3-sft-mixture β THE BEST |
| - **Size**: 939K examples from 19 curated sources |
| - **Format**: messages column (role/content) - ZERO PREPROCESSING |
| - **Sources**: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc. |
| - **Proven Results on Llama-3.1-8B**: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8 |
| - **Training Recipe**: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule |
| - **Status**: VALIDATED - column format confirmed via hf_inspect_dataset |
|
|
| ### #2: open-thoughts/OpenThoughts-114k β REASONING CoT |
| - **Size**: 114K examples with DeepSeek-R1 reasoning traces |
| - **Format**: conversations column (from/value ShareGPT) - NEEDS CONVERSION |
| - **Best For**: Math, code, science with chain-of-thought |
| - **Conversion**: See train_openthoughts.py |
| - **Training Recipe**: LR=2e-4, batch=16, epochs=2, cosine schedule |
| - **Status**: VALIDATED - format confirmed, converter tested |
| |
| ### #3: HuggingFaceH4/ultrachat_200k β GENERAL CHAT |
| - **Size**: 208K multi-turn conversations |
| - **Format**: messages column - ZERO PREPROCESSING (use train_sft split) |
| - **Best For**: General conversational ability |
| - **Training Recipe**: LR=2e-4, batch=16, epochs=1 |
| |
| ### #4: mlabonne/FineTome-100k β CURATED COMPACT |
| - **Size**: 100K quality-scored examples |
| - **Format**: conversations (ShareGPT) - NEEDS CONVERSION |
| - **Best For**: Quick fine-tune with curated quality |
| |
| ### #5: HuggingFaceH4/no_robots β HUMAN-WRITTEN |
| - **Size**: 9.5K examples (all human-written) |
| - **Format**: messages column - ZERO PREPROCESSING |
| - **Best For**: High-quality instruction following |
|
|
| ## How to Train |
|
|
| ### Full Training (Tulu 3 - 940K) β A100 80GB, ~6h |
| ``` |
| python ai-ml/hf-finetuning/train_tulu3.py |
| ``` |
|
|
| ### Reasoning Training (OpenThoughts - 114K) β A100 80GB, ~2h |
| ``` |
| python ai-ml/hf-finetuning/train_openthoughts.py |
| ``` |
|
|
| ### Quick Test (100 steps) β Any GPU |
| ``` |
| python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push |
| ``` |
|
|
| ## LoRA Config (LoRA Without Regret - Schulman 2025) |
|
|
| | Parameter | Tulu 3 Recipe | OpenThoughts Recipe | |
| |-----------|---------------|---------------------| |
| | lora_r | 256 | 256 | |
| | lora_alpha | 16 | 16 | |
| | target_modules | all-linear | all-linear | |
| | learning_rate | 5e-6 | 2e-4 | |
| | effective_batch | 128 | 16 | |
| | epochs | 2 | 2 | |
| | max_seq_length | 4096 | 4096 | |
| | lr_schedule | linear | cosine | |
| | packing | True (bfd_split) | True (bfd_split) | |
| | assistant_only_loss | True | True | |
|
|
| ## Key Research Sources |
| - Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card |
| - LoRA Without Regret: Schulman et al., 2025 |
| - Data quality > quantity: arXiv 2402.05123 |
|
|