File size: 2,771 Bytes
292504a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Best Datasets for SFT Fine-Tuning β€” Verified Guide

## Dataset Rankings (Quality β†’ Model Performance)

### #1: allenai/tulu-3-sft-mixture β€” THE BEST
- **Size**: 939K examples from 19 curated sources
- **Format**: messages column (role/content) - ZERO PREPROCESSING
- **Sources**: FLAN v2, Persona MATH, Evol CodeAlpaca, WildChat, Aya, NuminaMath, WildGuard, WildJailbreak, no_robots, OASST1, SciRIFF, etc.
- **Proven Results on Llama-3.1-8B**: MMLU 53.5, GSM8K 79.9, IFEval 63.6, HumanEval 76.8
- **Training Recipe**: LR=5e-6, batch=128, epochs=2, max_seq=4096, linear schedule
- **Status**: VALIDATED - column format confirmed via hf_inspect_dataset

### #2: open-thoughts/OpenThoughts-114k β€” REASONING CoT
- **Size**: 114K examples with DeepSeek-R1 reasoning traces
- **Format**: conversations column (from/value ShareGPT) - NEEDS CONVERSION
- **Best For**: Math, code, science with chain-of-thought
- **Conversion**: See train_openthoughts.py
- **Training Recipe**: LR=2e-4, batch=16, epochs=2, cosine schedule
- **Status**: VALIDATED - format confirmed, converter tested

### #3: HuggingFaceH4/ultrachat_200k β€” GENERAL CHAT
- **Size**: 208K multi-turn conversations
- **Format**: messages column - ZERO PREPROCESSING (use train_sft split)
- **Best For**: General conversational ability
- **Training Recipe**: LR=2e-4, batch=16, epochs=1

### #4: mlabonne/FineTome-100k β€” CURATED COMPACT
- **Size**: 100K quality-scored examples
- **Format**: conversations (ShareGPT) - NEEDS CONVERSION
- **Best For**: Quick fine-tune with curated quality

### #5: HuggingFaceH4/no_robots β€” HUMAN-WRITTEN
- **Size**: 9.5K examples (all human-written)
- **Format**: messages column - ZERO PREPROCESSING
- **Best For**: High-quality instruction following

## How to Train

### Full Training (Tulu 3 - 940K) β€” A100 80GB, ~6h
```
python ai-ml/hf-finetuning/train_tulu3.py
```

### Reasoning Training (OpenThoughts - 114K) β€” A100 80GB, ~2h
```
python ai-ml/hf-finetuning/train_openthoughts.py
```

### Quick Test (100 steps) β€” Any GPU
```
python ai-ml/hf-finetuning/train_tulu3.py --max_steps 100 --no_push
```

## LoRA Config (LoRA Without Regret - Schulman 2025)

| Parameter | Tulu 3 Recipe | OpenThoughts Recipe |
|-----------|---------------|---------------------|
| lora_r | 256 | 256 |
| lora_alpha | 16 | 16 |
| target_modules | all-linear | all-linear |
| learning_rate | 5e-6 | 2e-4 |
| effective_batch | 128 | 16 |
| epochs | 2 | 2 |
| max_seq_length | 4096 | 4096 |
| lr_schedule | linear | cosine |
| packing | True (bfd_split) | True (bfd_split) |
| assistant_only_loss | True | True |

## Key Research Sources
- Tulu 3: allenai/Llama-3.1-Tulu-3-8B-SFT model card
- LoRA Without Regret: Schulman et al., 2025
- Data quality > quantity: arXiv 2402.05123