devsecops-platform / model /DATASETS.md
shaikhsalman's picture
refactor: merged structure - model at center, DevSecOps wrapped around it
9d4d5c7 verified

Best Datasets for SFT Fine-Tuning β€” Verified Guide

Dataset Rankings (Quality β†’ Model Performance)

#1: allenai/tulu-3-sft-mixture β€” THE BEST

  • Size: 939K examples from 19 curated sources
  • Format: messages column (role/content) - ZERO PREPROCESSING
  • Sources: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
  • Proven Results on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
  • Training Recipe: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
  • Status: VALIDATED - column format confirmed via hf_inspect_dataset

#2: open-thoughts/OpenThoughts-114k β€” REASONING CoT

  • Size: 114K examples with DeepSeek-R1 reasoning traces
  • Format: conversations column (from/value ShareGPT) - NEEDS CONVERSION
  • Best For: Math, code, science with chain-of-thought
  • Conversion: See train_openthoughts.py
  • Training Recipe: LR=2e-4, batch=16, epochs=2, cosine schedule
  • Status: VALIDATED - format confirmed, converter tested

#3: HuggingFaceH4/ultrachat_200k β€” GENERAL CHAT

  • Size: 208K multi-turn conversations
  • Format: messages column - ZERO PREPROCESSING (use train_sft split)
  • Best For: General conversational ability
  • Training Recipe: LR=2e-4, batch=16, epochs=1

#4: mlabonne/FineTome-100k β€” CURATED COMPACT

  • Size: 100K quality-scored examples
  • Format: conversations (ShareGPT) - NEEDS CONVERSION
  • Best For: Quick fine-tune with curated quality

#5: HuggingFaceH4/no_robots β€” HUMAN-WRITTEN

  • Size: 9.5K examples (all human-written)
  • Format: messages column - ZERO PREPROCESSING
  • Best For: High-quality instruction following

How to Train

Full Training (Tulu 3 - 940K) β€” A100 80GB, ~6h

python ai-ml/hf-finetuning/train_tulu3.py

Reasoning Training (OpenThoughts - 114K) β€” A100 80GB, ~2h

python ai-ml/hf-finetuning/train_openthoughts.py

Quick Test (100 steps) β€” Any GPU

python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push

LoRA Config (LoRA Without Regret - Schulman 2025)

Parameter Tulu 3 Recipe OpenThoughts Recipe
lora_r 256 256
lora_alpha 16 16
target_modules all-linear all-linear
learning_rate 5e-6 2e-4
effective_batch 128 16
epochs 2 2
max_seq_length 4096 4096
lr_schedule linear cosine
packing True (bfd_split) True (bfd_split)
assistant_only_loss True True

Key Research Sources

  • Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
  • LoRA Without Regret: Schulman et al., 2025
  • Data quality > quantity: arXiv 2402.05123