LESS: Selecting Influential Data for Targeted Instruction Tuning
Paper • 2402.04333 • Published • 3
LoRA warmup checkpoints for Llama-2-7b-hf, trained following the LESS data selection pipeline. These checkpoints are used as the basis for gradient collection and influence scoring.
Four epoch-end checkpoints are provided, one per warmup epoch:
| Checkpoint | Epoch | Step | Loss | Learning Rate |
|---|---|---|---|---|
checkpoint-106 |
1 | 106 | 0.7571 | 1.80e-05 |
checkpoint-212 |
2 | 212 | 0.8417 | 1.09e-05 |
checkpoint-318 |
3 | 318 | 0.7988 | 3.30e-06 |
checkpoint-424 |
4 | 424 | 0.7691 | 3.05e-10 |
5% warmup fraction of princeton-nlp/less_data, packed with BFD packing strategy.
| Parameter | Value |
|---|---|
| Rank (r) | 128 |
| Alpha | 512 |
| Dropout | 0.1 |
| Bias | none |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Task type | CAUSAL_LM |
| Parameter | Value |
|---|---|
| Base model dtype | float32 |
| Training precision | bf16 |
| Epochs | 4 |
| Effective batch size | 128 |
| Per-device batch size | 4 |
| Gradient accumulation steps | 4 |
| Number of GPUs | 8 |
| Learning rate | 2e-5 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Max sequence length | 8192 |
| Packing | True |
| Gradient checkpointing | True |
| Optimizer | AdamW (torch) |
| Adam betas | (0.9, 0.999) |
| Adam epsilon | 1e-8 |
| Weight decay | 0.0 |
| Max grad norm | 1.0 |
| Seed | 42 |
| Total training steps | 424 |
| Total tokens seen | ~6.8M |
torchrun --nproc_per_node 8 -m examples.less --pdbs 4
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "EleutherAI/Llama-2-7b-hf-warmup", subfolder="checkpoint-106")
Base model
meta-llama/Llama-2-7b-hf