shaikhsalman
/

devsecops-platform

Model card Files Files and versions

devsecops-platform / model /DATASETS.md

shaikhsalman's picture

refactor: merged structure - model at center, DevSecOps wrapped around it

9d4d5c7 verified 19 days ago

|

history blame contribute delete

2.77 kB

Best Datasets for SFT Fine-Tuning — Verified Guide

Dataset Rankings (Quality → Model Performance)

#1: allenai/tulu-3-sft-mixture — THE BEST

Size: 939K examples from 19 curated sources
Format: messages column (role/content) - ZERO PREPROCESSING
Sources: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
Proven Results on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
Training Recipe: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
Status: VALIDATED - column format confirmed via hf_inspect_dataset

#2: open-thoughts/OpenThoughts-114k — REASONING CoT

Size: 114K examples with DeepSeek-R1 reasoning traces
Format: conversations column (from/value ShareGPT) - NEEDS CONVERSION
Best For: Math, code, science with chain-of-thought
Conversion: See train_openthoughts.py
Training Recipe: LR=2e-4, batch=16, epochs=2, cosine schedule
Status: VALIDATED - format confirmed, converter tested

#3: HuggingFaceH4/ultrachat_200k — GENERAL CHAT

Size: 208K multi-turn conversations
Format: messages column - ZERO PREPROCESSING (use train_sft split)
Best For: General conversational ability
Training Recipe: LR=2e-4, batch=16, epochs=1

#4: mlabonne/FineTome-100k — CURATED COMPACT

Size: 100K quality-scored examples
Format: conversations (ShareGPT) - NEEDS CONVERSION
Best For: Quick fine-tune with curated quality

#5: HuggingFaceH4/no_robots — HUMAN-WRITTEN

Size: 9.5K examples (all human-written)
Format: messages column - ZERO PREPROCESSING
Best For: High-quality instruction following

How to Train

Full Training (Tulu 3 - 940K) — A100 80GB, ~6h

python ai-ml/hf-finetuning/train_tulu3.py

Reasoning Training (OpenThoughts - 114K) — A100 80GB, ~2h

python ai-ml/hf-finetuning/train_openthoughts.py

Quick Test (100 steps) — Any GPU

python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push

LoRA Config (LoRA Without Regret - Schulman 2025)

Parameter	Tulu 3 Recipe	OpenThoughts Recipe
lora_r	256	256
lora_alpha	16	16
target_modules	all-linear	all-linear
learning_rate	5e-6	2e-4
effective_batch	128	16
epochs	2	2
max_seq_length	4096	4096
lr_schedule	linear	cosine
packing	True (bfd_split)	True (bfd_split)
assistant_only_loss	True	True

Key Research Sources

Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
LoRA Without Regret: Schulman et al., 2025
Data quality > quantity: arXiv 2402.05123