shaikhsalman
/

devsecops-platform

+# Model Enhancement — Dataset & Training Recipe vNext
+## What Changed (v1 → v2)
+| Parameter | v1 (Old) | v2 (LoRA Without Regret) | Why |
+|-----------|----------|--------------------------|-----|
+| **Dataset** | ultrachat_200k (5K subset) | **tulu-3-sft-mixture** (940K) | 19 curated sources > single source |
+| **LoRA r** | 16 | **256** | SFT-scale datasets need r=256 to match full FT |
+| **LoRA alpha** | 32 | **16** | Stable scaling with high rank |
+| **Target modules** | q/k/v/o_proj only | **all-linear** | Attention-only underperforms even at higher rank |
+| **Effective batch** | 32 | **16** | LoRA less tolerant of large batches |
+| **Learning rate** | 2e-4 | **2e-4** (same) | 10x full FT rate — correct in v1 |
+| **Packing** | False | **True (bfd_split)** | Preserves all tokens, 2-3x throughput |
+| **assistant_only_loss** | False | **True** | Loss only on assistant tokens |
+| **EOS token** | Not set | **<\|eot_id\|>** | Llama 3.1 chat template |
+| **LR scheduler** | linear | **cosine** | Better convergence for LoRA |
+| **Epochs** | 3 | **1** | 940K examples = 1 epoch sufficient |
+## Dataset Comparison
+| Dataset | Size | Format | Best For | Quality |
+|---------|------|--------|----------|---------|
+| **tulu-3-sft-mixture** | 940K | messages ✅ | General SFT (code, math, IF, safety, science) | ⭐⭐⭐⭐⭐ |
+| **OpenThoughts-114k** | 114K | conversations (needs conversion) | Reasoning, CoT traces | ⭐⭐⭐⭐ |
+| ultrachat_200k | 200K | messages ✅ | Multi-turn chat baseline | ⭐⭐⭐ |
+## Key Research: "LoRA Without Regret" (Schulman et al., 2025)
+Four findings that change how we fine-tune:
+1. **Target ALL linear layers** — not just attention. Increasing rank does NOT compensate for skipping layers.
+2. **Use r=256 for SFT** — sufficient capacity for post-training scale datasets.
+3. **Use 10x higher LR** (2e-4 vs 2e-5 for full FT) — 1/r scaling makes optimal LR rank-independent.
+4. **Keep batch size < 32** — LoRA is less tolerant of large batches. Cannot be mitigated by increasing rank.
+## Recommended Training Matrix
+### SFT (Supervised Fine-Tuning)
+| Model | Dataset | Hardware | Time | Cost |
+|-------|---------|----------|------|------|
+| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A100 (80GB) | ~6h | ~$24 |
+| Llama-3.1-8B-Instruct | OpenThoughts-114k | A100 (80GB) | ~2h | ~$8 |
+| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A10G (24GB) + QLoRA | ~12h | ~$24 |
+### GRPO (Reinforcement Learning)
+| Model | Dataset | LoRA r | Hardware |
+|-------|---------|--------|----------|
+| Qwen3-0.6B | OpenR1-Math-220k | 1 | A100 |
+| Llama-3.1-8B-Base | GSM8k | 1-32 | A100 |
+## Source Attribution
+- LoRA Without Regret: Schulman et al., 2025, Thinking Machines Lab
+- tulu-3-sft-mixture: Allen AI, used by Tulu 3 (SOTA open instruction-tuned)
+- OpenThoughts-114k: Open community, reasoning-heavy CoT data
+- LoRA Land: Predibase 2024, 224/310 LoRA models surpassed GPT-4