Safetensors
qwen2
leave-one-out
loo-domain-knowledge
max_seq_length=16384
lr=2e-5
batch_size=1
grad_accum=16
epochs=1
qlora
quantize=4bit_nf4
lora_rank=64
lora_alpha=128
lora_dropout=0.05
completion_only_loss
eval_size=0.1
cosine_schedule
warmup=0.05
bf16
ddp_workers=2
ray_job=raysubmit_A55M5NnZckrXmfWN
| datasets: | |
| - tinyllms/game24-trajectories | |
| - tinyllms/aime-1983-2023-trajectories | |
| base_model: | |
| - Qwen/Qwen2.5-7B-Instruct | |
| tags: | |
| - leave-one-out | |
| - loo-domain-knowledge | |
| - max_seq_length=16384 | |
| - lr=2e-5 | |
| - batch_size=1 | |
| - grad_accum=16 | |
| - epochs=1 | |
| - qlora | |
| - quantize=4bit_nf4 | |
| - lora_rank=64 | |
| - lora_alpha=128 | |
| - lora_dropout=0.05 | |
| - completion_only_loss | |
| - eval_size=0.1 | |
| - cosine_schedule | |
| - warmup=0.05 | |
| - bf16 | |
| - ddp_workers=2 | |
| - ray_job=raysubmit_A55M5NnZckrXmfWN | |
| # Qwen2.5-7B-Instruct SFT — LOO Domain Knowledge | |
| Fine-tuned from **Qwen/Qwen2.5-7B-Instruct** using QLoRA (4-bit NF4 quantization + LoRA adapters, merged before upload). | |
| This is the **SFT stage** of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately **excluding domain knowledge (GPQA)** data. The held-out domain is later used to measure cross-domain transfer. | |
| ## Training Configuration | |
| - **Learning rate:** 2e-5 (cosine schedule, 5% warmup) | |
| - **Batch size:** 1 per device, gradient accumulation 16 (effective batch size 32 with 2 workers) | |
| - **Epochs:** 1 | |
| - **Max sequence length:** 16384 | |
| - **Precision:** bf16 | |
| - **Weight decay:** 0.01 | |
| ## QLoRA | |
| - **Quantization:** 4-bit NF4 with double quantization | |
| - **LoRA rank:** 64 | |
| - **LoRA alpha:** 128 | |
| - **LoRA dropout:** 0.05 | |
| - **Target modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | |
| ## Loss | |
| - **completion_only_loss:** prompt tokens are masked; loss is computed only on assistant completion tokens | |
| - Dataset is converted from `messages` to `prompt`/`completion` format before training | |
| ## Datasets | |
| Trained on two datasets (domain knowledge held out): | |
| | Dataset | Domain | | |
| |---------|--------| | |
| | `tinyllms/game24-trajectories` | Game of 24 — arithmetic reasoning | | |
| | `tinyllms/aime-1983-2023-trajectories` | AIME — competition math | | |
| Examples exceeding `max_seq_len` are filtered out. A 10% holdout is used for evaluation (eval runs every 10 steps). | |
| ## Leave-One-Out Design | |
| | Domain | Role | | |
| |--------|------| | |
| | Game24 | Train | | |
| | AIME | Train | | |
| | Domain Knowledge (GPQA) | **Held out** | | |
| The GRPO stage follows using `tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge`, trained on the same two datasets. Transfer is measured by evaluating on GPQA Diamond. | |
| ## Infrastructure | |
| - **GPU:** 2x NVIDIA H100 80GB (DDP) | |
| - **Framework:** TRL 0.29 + Ray Train | |
| - **Tracking:** [Weights & Biases](https://wandb.ai/psr-labs/pocket-sheet-sft/runs/pzs50igz) (project: `pocket-sheet-sft`) | |
| - **Ray Job ID:** raysubmit_A55M5NnZckrXmfWN |