refactor: merged structure - model at center, DevSecOps wrapped around it

9d4d5c7 verified 19 days ago

2.77 kB

	# Best Datasets for SFT Fine-Tuning — Verified Guide

	## Dataset Rankings (Quality → Model Performance)

	### #1: allenai/tulu-3-sft-mixture — THE BEST
	- Size: 939K examples from 19 curated sources
	- Format: messages column (role/content) - ZERO PREPROCESSING
	- Sources: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
	- Proven Results on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
	- Training Recipe: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
	- Status: VALIDATED - column format confirmed via hf_inspect_dataset

	### #2: open-thoughts/OpenThoughts-114k — REASONING CoT
	- Size: 114K examples with DeepSeek-R1 reasoning traces
	- Format: conversations column (from/value ShareGPT) - NEEDS CONVERSION
	- Best For: Math, code, science with chain-of-thought
	- Conversion: See train_openthoughts.py
	- Training Recipe: LR=2e-4, batch=16, epochs=2, cosine schedule
	- Status: VALIDATED - format confirmed, converter tested

	### #3: HuggingFaceH4/ultrachat_200k — GENERAL CHAT
	- Size: 208K multi-turn conversations
	- Format: messages column - ZERO PREPROCESSING (use train_sft split)
	- Best For: General conversational ability
	- Training Recipe: LR=2e-4, batch=16, epochs=1

	### #4: mlabonne/FineTome-100k — CURATED COMPACT
	- Size: 100K quality-scored examples
	- Format: conversations (ShareGPT) - NEEDS CONVERSION
	- Best For: Quick fine-tune with curated quality

	### #5: HuggingFaceH4/no_robots — HUMAN-WRITTEN
	- Size: 9.5K examples (all human-written)
	- Format: messages column - ZERO PREPROCESSING
	- Best For: High-quality instruction following

	## How to Train

	### Full Training (Tulu 3 - 940K) — A100 80GB, ~6h
	```
	python ai-ml/hf-finetuning/train_tulu3.py
	```

	### Reasoning Training (OpenThoughts - 114K) — A100 80GB, ~2h
	```
	python ai-ml/hf-finetuning/train_openthoughts.py
	```

	### Quick Test (100 steps) — Any GPU
	```
	python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
	```

	## LoRA Config (LoRA Without Regret - Schulman 2025)

	\| Parameter \| Tulu 3 Recipe \| OpenThoughts Recipe \|
	\|-----------\|---------------\|---------------------\|
	\| lora_r \| 256 \| 256 \|
	\| lora_alpha \| 16 \| 16 \|
	\| target_modules \| all-linear \| all-linear \|
	\| learning_rate \| 5e-6 \| 2e-4 \|
	\| effective_batch \| 128 \| 16 \|
	\| epochs \| 2 \| 2 \|
	\| max_seq_length \| 4096 \| 4096 \|
	\| lr_schedule \| linear \| cosine \|
	\| packing \| True (bfd_split) \| True (bfd_split) \|
	\| assistant_only_loss \| True \| True \|

	## Key Research Sources
	- Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
	- LoRA Without Regret: Schulman et al., 2025
	- Data quality > quantity: arXiv 2402.05123

	# Best Datasets for SFT Fine-Tuning — Verified Guide

	## Dataset Rankings (Quality → Model Performance)

	### #1: allenai/tulu-3-sft-mixture — THE BEST
	- Size: 939K examples from 19 curated sources
	- Format: messages column (role/content) - ZERO PREPROCESSING
	- Sources: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
	- Proven Results on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
	- Training Recipe: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
	- Status: VALIDATED - column format confirmed via hf_inspect_dataset

	### #2: open-thoughts/OpenThoughts-114k — REASONING CoT
	- Size: 114K examples with DeepSeek-R1 reasoning traces
	- Format: conversations column (from/value ShareGPT) - NEEDS CONVERSION
	- Best For: Math, code, science with chain-of-thought
	- Conversion: See train_openthoughts.py
	- Training Recipe: LR=2e-4, batch=16, epochs=2, cosine schedule
	- Status: VALIDATED - format confirmed, converter tested

	### #3: HuggingFaceH4/ultrachat_200k — GENERAL CHAT
	- Size: 208K multi-turn conversations
	- Format: messages column - ZERO PREPROCESSING (use train_sft split)
	- Best For: General conversational ability
	- Training Recipe: LR=2e-4, batch=16, epochs=1

	### #4: mlabonne/FineTome-100k — CURATED COMPACT
	- Size: 100K quality-scored examples
	- Format: conversations (ShareGPT) - NEEDS CONVERSION
	- Best For: Quick fine-tune with curated quality

	### #5: HuggingFaceH4/no_robots — HUMAN-WRITTEN
	- Size: 9.5K examples (all human-written)
	- Format: messages column - ZERO PREPROCESSING
	- Best For: High-quality instruction following

	## How to Train

	### Full Training (Tulu 3 - 940K) — A100 80GB, ~6h
	```
	python ai-ml/hf-finetuning/train_tulu3.py
	```

	### Reasoning Training (OpenThoughts - 114K) — A100 80GB, ~2h
	```
	python ai-ml/hf-finetuning/train_openthoughts.py
	```

	### Quick Test (100 steps) — Any GPU
	```
	python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
	```

	## LoRA Config (LoRA Without Regret - Schulman 2025)

	\| Parameter \| Tulu 3 Recipe \| OpenThoughts Recipe \|
	\|-----------\|---------------\|---------------------\|
	\| lora_r \| 256 \| 256 \|
	\| lora_alpha \| 16 \| 16 \|
	\| target_modules \| all-linear \| all-linear \|
	\| learning_rate \| 5e-6 \| 2e-4 \|
	\| effective_batch \| 128 \| 16 \|
	\| epochs \| 2 \| 2 \|
	\| max_seq_length \| 4096 \| 4096 \|
	\| lr_schedule \| linear \| cosine \|
	\| packing \| True (bfd_split) \| True (bfd_split) \|
	\| assistant_only_loss \| True \| True \|

	## Key Research Sources
	- Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
	- LoRA Without Regret: Schulman et al., 2025
	- Data quality > quantity: arXiv 2402.05123